图形数据库

Graph Databases

第二版

Second Edition

伊恩·罗宾逊、吉姆·韦伯和埃米尔·艾弗莱姆

Ian Robinson, Jim Webber & Emil Eifrem

图形数据库

Graph Databases

作者:Ian RobinsonJim WebberEmil Eifrem

by Ian Robinson, Jim Webber, and Emil Eifrem

在美国印刷。

Printed in the United States of America.

由O'Reilly Media, Inc.出版,地址为 1005 Gravenstein Highway North, Sebastopol, CA 95472。

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

O'Reilly 图书可供教育、商业或促销使用。大多数图书还提供在线版本 ( http://safaribooksonline.com )。如需更多信息,请联系我们的企业/机构销售部门:800-998-9938 或corporate@oreilly.com

O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com.

  • 編輯: Marie Beaugureau
  • Editor: Marie Beaugureau
  • 制作编辑:克里斯汀·布朗
  • Production Editor: Kristen Brown
  • 校对:克里斯蒂娜·爱德华兹
  • Proofreader: Christina Edwards
  • 索引器: WordCo 索引服务
  • Indexer: WordCo Indexing Services
  • 室内设计师: David Futato
  • Interior Designer: David Futato
  • 封面设计师: Ellie Volckhausen
  • Cover Designer: Ellie Volckhausen
  • 插画家: Rebecca Demarest
  • Illustrator: Rebecca Demarest
  • 2013 年 6 月:第一版
  • June 2013: First Edition
  • 2015 年 6 月:第二版
  • June 2015: Second Edition

第二版修订历史

Revision History for the Second Edition

  • 2015-06-09:首次发布
  • 2015-06-09: First Release

有关发布详细信息,请参阅http://oreilly.com/catalog/errata.csp?isbn=9781491930892 。

See http://oreilly.com/catalog/errata.csp?isbn=9781491930892 for release details.

前言

Foreword

图表正在吞噬世界,而且一去不复返

Graphs Are Eating The World, And There’s No Going Back

自我们首次编写图形数据库以来的三年里,我们的行业见证了其数据资产看法的根本性转变。

In the three years since we first wrote Graph Databases, our industry has witnessed a fundamental shift in the way in which it views its data assets.

数据始终存在于创新的某个层面,但几十年来,它只发挥了一小部分潜力,这在很大程度上是因为我们掌握的技术迫使我们将其视为无关紧要的孤岛。图形和图形数据库彻底改变了这一现状。

Data, always present in some stratum of innovation, has for several decades delivered only a fraction of its potential, in large part because the technologies at our disposal have forced us to treat it as though it were nothing but isolated islands of middling significance. Graphs and graph databases change this completely.

随着垂直行业陆续发现互联数据的变革力量,这些行业的领先者正在以不可逆转的速度抢占竞争对手的先机。图表无处不在,它们正在吞噬世界,而且一去不复返。

As vertical after vertical discovers the transformative power of connected data, the breakaway leaders in these industries are stealing an irreversible march on their competitors. Graphs are everywhere, they’re eating the world, and there’s no going back.

正如我在第一版的前言中所写,这种观点的转变始于近 20 年前,当时一家早熟的网络搜索初创公司通过应用一种简单的算法来理解网络文档的连接方式,挑战了市场领导者 AltaVista、Lycos、Excite 等公司的主导地位。

As I wrote in my foreword to the first edition, this change in perspective started almost two decades ago, when a precocious web search startup challenged the dominance of the market leaders — AltaVista, Lycos, Excite, et al — through its application of a simple algorithm that made sense of the way in which web documents are connected.

如今,谷歌主宰着网络搜索领域。在此之后,其他行业领导者也开始自问:“如果我们利用数据中的关系和联系,并根据这些关系重新构想我们的业务,会怎么样?那会是什么样子?”这些问题的答案在我们今天的在线生活中无处不在,比如 Facebook、Twitter 等。

Today, Google dominates the web search space. In its wake, other industry leaders have asked themselves: “What if we take the relationships and connections in our data and reimagined our business along those relationships? What would that look like?” The answers to these questions are omnipresent in our online lives today in the form of Facebook, Twitter, and the like.

曾经,图形数据库是一种专业且通常专有的手段,用于实现互联数据所固有的机遇,如今已成为一种商品技术。在过去三年中,世界领先的图形数据库的功能、可用性和性能已经非常成熟;认知和采用的渗透范围比我们预期的要广、要深、要快得多;将图形数据库引入以前以离散数据为导向的领域的创造性和不可逆转的影响在每个转折点都给市场注入了活力并带来了挑战。

What was once a specialist and often proprietary means for realizing the opportunities inherent in connected data is now a commodity technology. In the past three years the features, usability, and performance of the world’s leading graph database have matured enormously; awareness and adoption have penetrated far wider, deeper, and more quickly than we could have hoped; and the inventiveness and irreversible impact of introducing graph databases into formerly discrete-data-oriented domains have invigorated and challenged the markets at every turn.

2011 年,我们认为采用图形数据库的主要垂直行业是软件、金融服务和电信;我们基本上是对的。然而,更令人惊讶的是,除了这三大垂直行业之外,图形数据库的采用也越来越多。

In 2011, we thought the main verticals to adopt graph databases would be software, financial services, and telecom; and largely we were right. However, what’s been even more amazing has been the adoption of graph databases outside of those top three verticals.

我们看到一个又一个行业被图表所吞噬。在每种情况下,图表技术的采用都带来了更好的产品和更卓越的客户体验。Pitney Bowes、eBay 和 Cisco 等公司正在部署图表来解决一些最关键的任务问题,迫使竞争对手赶上或退出该行业。如今,全球十大零售商中有四家使用 Neo4j。在他们身后,那些适应能力较差的竞争对手正因为未能适应而苦苦挣扎。

We’ve seen industry after industry being eaten by graphs. In each case, the adoption of graph technology has resulted in better products and more remarkable customer experiences. Companies such as Pitney Bowes, eBay, and Cisco are deploying the graph to solve some of their most mission-critical problems, forcing their competition to catch up or leave the industry. Four of the top ten global retailers today use Neo4j. Behind them, their non-adapting competitors are struggling to make it because they’ve failed to adapt.

图形数据库能够占领并彻底改变一个行业,这一点在新兴的物联网 (IoT) 中体现得最为明显。物联网更适合被称为“互联物联网” 因为没有连接,物联网就毫无意义。当你拥有大量互联的事物时,你就面临一个基于图形的问题。

This ability of graph databases to colonize and radically transform an industry is nowhere more apparent than in the emerging Internet of Things (IoT), a domain which might more aptly be called the Internet of Connected Things, because without the connections, there’s no point to it. When you have a lot of connected things, you have a graph-based problem.

近年来,一家大型电信设备供应商已进入物联网领域,推出了一款产品,该产品嵌入大型电信网络,可嗅探网络流量并构建网络上所有连接设备的模型。如果某一类设备同时闪烁红灯,您可以轻松确定这是否真的是因为所有设备同时出现故障,还是因为它们都连接到刚刚断电的防火墙和电源。当您从连接的角度看待物联网时,您可以进行这种级别的实时预测分析。

In recent years, a major telco equipment provider has entered the IoT space with a product that, embedded inside large telecom networks, sniffs network traffic and builds a model of all the connected devices on the network. If devices in one category are all flashing red at the same time, you can easily determine if it’s truly because all of them are simultaneously failing or if it’s because they’re all connected to a firewall and power supply that has just gone out. That level of real-time, predictive analysis is what you can do when taking a connected view of the IoT.

此类解决方案能够如此快速地开发并投入生产,这得益于底层图形数据库技术的一些重大变化。2013 年,我们推出了 Neo4j 2.0,标志着产品的功能、可用性和性能发生了巨大变化。除了全新的可视化工具外,Neo4j 2.0 还带来了改进的数据模型,其主要功能、标签、可选约束和声明性索引以及对 Cypher 查询语言的众多改进使图形数据库应用程序的设计和开发比以往任何时候都更加轻松、直观。

The speed with which such solutions can be developed and put into production is a result of some significant changes to the underlying graph database technology. In 2013 we introduced Neo4j 2.0, marking a big change in the features, usability, and performance of the product. Besides a wholly new visualization tool, Neo4j 2.0 came with an improved data model, whose chief features, labels, optional constraints, and declarative indexes — coupled with numerous improvements to the Cypher query language — make designing and developing a graph database application easier and more intuitive than ever before.

随着技术的成熟,社区影响力也出现了惊人的增长。根据 db-engines.com 的数据,自 2013 年以来,图形数据库一直是增长最快的数据库类别。大数据是科技行业增长最快的领域,而图形数据库是这一增长的绝对中心。图形确实正在席卷全球,而且势不可挡。

Accompanying this maturation of the technology is an amazing growth in community traction. According to db-engines.com, graph databases have been the fastest growing database category since 2013. Big data is the hottest growing sector in the tech industry, and graph databases are at the absolute nexus of that growth. Graphs are indeed eating the world, and there’s no turning back.

我希望《图形数据库》的新版本能够成为不断发展的图形技术世界的一次重大更新(或起点),并且我希望它能够激励您在下一个项目中开始使用图形数据库,或者如果您已经进入图形领域,那么可以以更令人惊奇的方式应用该技术。

I hope this new edition of Graph Databases will serve as a great update (or starting point) to the growing world of graph technologies, and I hope it will inspire you to start using a graph database in your next project, or to apply the technology in even more amazing ways if you’ve already taken the leap into the graph.

前言

Preface

图形数据库解决了当今最大的宏观商业趋势之一:利用高度关联数据中复杂而动态的关系来获得洞察力和竞争优势。无论我们想了解客户、电话或数据中心网络中的元素、娱乐制作人和消费者之间的关系,还是基因和蛋白质之间的关系,理解和分析大量高度关联数据的能力将是决定哪些公司在未来十年胜过竞争对手的关键。

Graph databases address one of the great macroscopic business trends of today: leveraging complex and dynamic relationships in highly connected data to generate insight and competitive advantage. Whether we want to understand relationships between customers, elements in a telephone or data center network, entertainment producers and consumers, or genes and proteins, the ability to understand and analyze vast graphs of highly connected data will be key in determining which companies outperform their competitors over the coming decade.

对于任何规模或价值都很大的数据,图形数据库都是表示和查询关联数据的最佳方式。关联数据的解释和价值要求我们首先了解其组成元素之间的关联方式。通常,为了产生这种理解,我们需要命名和限定事物之间的关联。

For data of any significant size or value, graph databases are the best way to represent and query connected data. Connected data is data whose interpretation and value requires us first to understand the ways in which its constituent elements are related. More often than not, to generate this understanding, we need to name and qualify the connections between things.

尽管大型公司早就意识到了这一点,并开始创建自己的专有图形处理技术,但我们现在正处于该技术迅速普及的时代。如今,通用图形数据库已成为现实,主流用户无需投资构建自己的图形基础设施即可体验互联数据的好处。

Although large corporations realized this some time ago and began creating their own proprietary graph processing technologies, we’re now in an era where that technology has rapidly become democratized. Today, general-purpose graph databases are a reality, enabling mainstream users to experience the benefits of connected data without having to invest in building their own graph infrastructure.

图形数据和图形思维的复兴最引人注目之处在于图论本身并不是新鲜事物。图论是欧拉在 18 世纪首创的,此后一直受到数学家、社会学家、人类学家和其他从业者的积极研究和改进。然而,直到最近几年,图论和图思维才被应用于信息管理。在此期间,图数据库帮助解决了社交网络、主数据管理、地理空间、推荐等领域的重要问题。人们对图数据库的关注度不断提高,这主要由两股力量推动:Facebook、Google 和 Twitter 等公司取得了巨大的商业成功,这些公司都以自己的专有图技术为业务模式中心;通用图数据库被引入技术领域。

What’s remarkable about this renaissance of graph data and graph thinking is that graph theory itself is not new. Graph theory was pioneered by Euler in the 18th century, and has been actively researched and improved by mathematicians, sociologists, anthropologists, and other practitioners ever since. However, it is only in the past few years that graph theory and graph thinking have been applied to information management. In that time, graph databases have helped solve important problems in the areas of social networking, master data management, geospatial, recommendations, and more. This increased focus on graph databases is driven by two forces: by the massive commercial success of companies such as Facebook, Google, and Twitter, all of whom have centered their business models around their own proprietary graph technologies; and by the introduction of general-purpose graph databases into the technology landscape.

关于第二版

About the Second Edition

本书第一版是在 Neo4j 2.0 积极开发期间编写的,当时标签、索引和约束的最终形式尚待确定。现在 Neo4j 已进入其 2.x 生命周期(撰写本文时为 2.2,2.3 即将推出),我们可以放心地将图形属性模型的新元素纳入文本中。

The first edition of this book was written while Neo4j 2.0 was under active development, when the final forms of labels, indexes, and constraints were still to be fixed. Now that Neo4j is well into its 2.x lifecycle (2.2 at the time of writing, with 2.3 coming soon), we can confidently incorporate the new elements of the graph property model into the text.

对于本书的第二版,我们修改了所有 Cypher 示例,使其符合最新的 Cypher 语法。我们为查询和图表添加了标签,并提供了 Cypher 声明式索引和可选约束的解释。此外,我们还添加了额外的建模指南,根据 Neo4j 内部架构的变化更新了其内部描述,并更新了测试示例以使用最新的工具。

For the second edition of this book, we’ve revised all the Cypher examples to bring them in line with the latest Cypher syntax. We’ve added labels both to the queries and the diagrams, and have provided explanations of Cypher’s declarative indexing and optional constraints. Elsewhere, we’ve added additional modeling guidelines, brought the description of Neo4j’s internals up to date with the changes to its internal architecture, and updated the testing examples to use the latest tooling.

关于本书

About This Book

本书旨在向技术从业者(包括开发人员、数据库专业人员和技术决策者)介绍图形和图形数据库。阅读本书将使您对图形数据库有实际的了解。我们展示了图形模型如何“塑造”数据,以及我们如何使用图形数据库查询、推理、理解和处理数据。我们讨论了与图形数据库密切相关的问题类型,并列举了来自实际实际用例的示例,并展示了如何规划和实施图形数据库解决方案。

The purpose of this book is to introduce graphs and graph databases to technology practitioners, including developers, database professionals, and technology decision makers. Reading this book will give you a practical understanding of graph databases. We show how the graph model “shapes” data, and how we query, reason about, understand, and act upon data using a graph database. We discuss the kinds of problems that are well aligned with graph databases, with examples drawn from actual real-world use cases, and we show how to plan and implement a graph database solution.

本书中使用的约定

Conventions Used in This Book

本书采用了以下印刷约定:

The following typographical conventions are used in this book:

斜体
表示新术语、URL、电子邮件地址、文件名和文件扩展名。
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Constant width
用于程序列表,以及段落内引用程序元素,例如变量或函数名称、数据库、数据类型、环境变量、语句和关键字。
Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.
Constant width bold
显示应由用户逐字输入的命令或其他文本。
Shows commands or other text that should be typed literally by the user.
Constant width italic
显示应由用户提供的值或由上下文确定的值替换的文本。
Shows text that should be replaced with user-supplied values or by values determined by context.

提示

此图标表示提示、建议或一般说明。

This icon signifies a tip, suggestion, or general note.



警告

此图标表示警告或警示。

This icon indicates a warning or caution.


使用代码示例

Using Code Examples

补充材料(代码示例、练习等)可在https://github.com/iansrobinson/graph-databases-use-cases下载。

Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/iansrobinson/graph-databases-use-cases.

本书旨在帮助您完成工作。一般来说,如果本书提供了示例代码,您可以在程序和文档中使用它。除非您要复制大量代码,否则无需联系我们获取许可。例如,编写使用本书中几段代码的程序不需要许可。销售或分发 O'Reilly 书籍示例的 CD-ROM 则需要许可。通过引用本书并引用示例代码来回答问题不需要许可。将本书中的大量示例代码合并到您的产品文档中需要许可。

This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.

我们欢迎但并不要求注明出处。注明出处通常包括书名、作者、出版商和 ISBN。例如:“ Ian Robinson、Jim Webber 和 Emil Eifrem 编著的《图形数据库》(O'Reilly)。版权所有 2015 Neo Technology, Inc.,978-1-491-93089-2。”

We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Graph Databases by Ian Robinson, Jim Webber, and Emil Eifrem (O’Reilly). Copyright 2015 Neo Technology, Inc., 978-1-491-93089-2.”

如果您认为您对代码示例的使用超出了合理使用或上述许可的范围,请随时通过与我们联系。

If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at .

Safari® 在线图书

Safari® Books Online


笔记

Safari Books Online是一个按需数字图书馆,以书籍和视频形式提供来自世界顶尖技术和商业作家的专家内容。

Safari Books Online is an on-demand digital library that delivers expert content in both book and video form from the world’s leading authors in technology and business.


技术专业人员、软件开发人员、网页设计师以及商务和创意专业人员使用 Safari Books Online 作为他们进行研究、解决问题、学习和认证培训的主要资源。

Technology professionals, software developers, web designers, and business and creative professionals use Safari Books Online as their primary resource for research, problem solving, learning, and certification training.

Safari Books Online为企业政府教育和个人提供了一系列的计划和价格。

Safari Books Online offers a range of plans and pricing for enterprise, government, education, and individuals.

会员可以通过一个可全面搜索的数据库访问来自 O'Reilly Media、Prentice Hall Professional、Addison-Wesley Professional、Microsoft Press、Sams、Que、Peachpit Press、Focal Press、Cisco Press、John Wiley & Sons、Syngress、Morgan Kaufmann、IBM Redbooks、Packt、Adobe Press、FT Press、Apress、Manning、New Riders、McGraw-Hill、Jones & Bartlett、Course Technology 等数百家出版商的数千本图书、培训视频和未出版手稿有关 Safari Books Online 的更多信息,请访问我们的在线网站

Members have access to thousands of books, training videos, and prepublication manuscripts in one fully searchable database from publishers like O’Reilly Media, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, Course Technology, and hundreds more. For more information about Safari Books Online, please visit us online.

如何联系我们

How to Contact Us

请将有关本书的评论和问题发送给出版商:

Please address comments and questions concerning this book to the publisher:

  • O'Reilly Media, Inc.
  • O’Reilly Media, Inc.
  • 1005 格拉文斯坦高速公路北
  • 1005 Gravenstein Highway North
  • 塞瓦斯托波尔,加利福尼亚州 95472
  • Sebastopol, CA 95472
  • 800-998-9938(美国或加拿大)
  • 800-998-9938 (in the United States or Canada)
  • 707-829-0515 (国际或本地)
  • 707-829-0515 (international or local)
  • 707-829-0104(传真)
  • 707-829-0104 (fax)

我们为这本书建立了一个网页,其中列出了勘误表、示例和任何其他信息。您可以通过http://bit.ly/graph-databases-2e访问该网页。

We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://bit.ly/graph-databases-2e.

如对本书发表评论或询问技术问题,请发送电子邮件至

To comment or ask technical questions about this book, send email to .

有关我们的书籍、课程、会议和新闻的更多信息,请访问我们的网站http://www.oreilly.com

For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.

在 Facebook 上找到我们:http://facebook.com/oreilly

Find us on Facebook: http://facebook.com/oreilly

在 Twitter 上关注我们:http ://twitter.com/oreillymedia

Follow us on Twitter: http://twitter.com/oreillymedia

在 YouTube 上观看我们:http://www.youtube.com/oreillymedia

Watch us on YouTube: http://www.youtube.com/oreillymedia

致谢

Acknowledgments

我们要感谢我们的技术审核员:Michael Hunger、Colin Jack、Mark Needham 和 Pramod Sadalage。

We would like to thank our technical reviewers: Michael Hunger, Colin Jack, Mark Needham, and Pramod Sadalage.

我们非常感谢第一版的编辑 Nathan Jepson。

Our appreciation and thanks to our editor for the first edition, Nathan Jepson.

Neo Technology 的同事在本书的写作过程中贡献了大量的时间、经验和精力。特别感谢 Anders Nawroth,感谢他对本书工具链的宝贵帮助;感谢 Andrés Taylor,感谢他对 Cypher 的所有事宜的热情帮助;感谢 Philip Rathle,感谢他对本书的建议和贡献。

Our colleagues at Neo Technology have contributed enormously of their time, experience, and effort throughout the writing of this book. Thanks in particular go to Anders Nawroth, for his invaluable assistance with our book’s toolchain; Andrés Taylor, for his enthusiastic help with all things Cypher; and Philip Rathle, for his advice and contributions to the text.

非常感谢 Neo4j 社区的每个人多年来对图形数据库领域的诸多贡献。

A big thank you to everyone in the Neo4j community for your many contributions to the graph database space over the years.

特别感谢我们的家人,感谢他们的爱和支持:洛蒂、泰格、艾略特、凯丝、比利、玛德琳和诺米。

And special thanks to our families, for their love and support: Lottie, Tiger, Elliot, Kath, Billy, Madelene, and Noomi.

第二版的出版得益于 Cristina Escalante 和 Michael Hunger 的辛勤工作。感谢你们两位的宝贵帮助。

This second edition was made possible by the diligent work of Cristina Escalante and Michael Hunger. Thank you to both of you for your invaluable help.

第 1 章简介

Chapter 1. Introduction

虽然本书的大部分内容都在讨论图形数据模型,但它并不是一本关于图论的书。1我们不需要太多理论就可以利用图形数据库:只要我们理解了什么是图,我们就可以实际操作了。考虑到这一点,让我们重新回顾一下关于图的一般知识。

Although much of this book talks about graph data models, it is not a book about graph theory.1 We don’t need much theory to take advantage of graph databases: provided we understand what a graph is, we’re practically there. With that in mind, let’s refresh our memories about graphs in general.

什么是图表?

What Is a Graph?

形式上,图只是一个集合顶点的集合——或者用不那么吓人的语言来说,一组节点和连接它们的关系。图将实体表示为节点,将这些实体与世界的关系表示为关系。这种通用的、富有表现力的结构使我们能够模拟各种场景,从太空火箭的建造到道路系统,从食品的供应链或来源到人口的病史等等。

Formally, a graph is just a collection of vertices and edges — or, in less intimidating language, a set of nodes and the relationships that connect them. Graphs represent entities as nodes and the ways in which those entities relate to the world as relationships. This general-purpose, expressive structure allows us to model all kinds of scenarios, from the construction of a space rocket, to a system of roads, and from the supply-chain or provenance of foodstuff, to medical history for populations, and beyond.

例如,Twitter 的数据很容易用图表来表示。在图 1-1中,我们看到一个小的Twitter 用户网络。每个节点都标有User,表明其在网络中的角色。然后,这些节点与关系相连,这有助于进一步建立语义上下文:即 Billy 关注 Harry,而 Harry 又关注 Billy。Ruth 和 Harry 也互相关注,但遗憾的是,尽管 Ruth 关注了 Billy,但 Billy 尚未(尚未)关注。

For example, Twitter’s data is easily represented as a graph. In Figure 1-1 we see a small network of Twitter users. Each node is labeled User, indicating its role in the network. These nodes are then connected with relationships, which help further establish the semantic context: namely, that Billy follows Harry, and that Harry, in turn, follows Billy. Ruth and Harry likewise follow each other, but sadly, although Ruth follows Billy, Billy hasn’t (yet) reciprocated.

格数据库 0101
图 1-1.一个小型社交图谱

当然,Twitter 的真实图谱比图 1-1中的示例大数亿倍,但它的工作原理完全相同。在图 1-2中,我们扩展了图谱以包含已发布的消息由露丝撰写。

Of course, Twitter’s real graph is hundreds of millions of times larger than the example in Figure 1-1, but it works on precisely the same principles. In Figure 1-2 we’ve expanded the graph to include the messages published by Ruth.

格数据库 0102
图 1-2.发布消息

图 1-2虽然简单,但却展示了图模型的表达能力。很容易看出 Ruth 发布了一系列消息。通过跟踪标记为 的关系可以找到她最近的消息CURRENTPREVIOUS然后这些关系就构成了 Ruth 的时间线。

Though simple, Figure 1-2 shows the expressive power of the graph model. It’s easy to see that Ruth has published a string of messages. Her most recent message can be found by following a relationship marked CURRENT. The PREVIOUS relationships then create Ruth’s timeline.

图空间的高层视图

A High-Level View of the Graph Space

近年来,用于管理、处理和分析图的许多项目和产品如雨后春笋般涌现。技术数量之多使得我们很难跟踪这些工具以及它们之间的区别,即使对于我们这些活跃于该领域的人来说也是如此。本节提供了一个高级框架,用于理解新兴的图格局。

Numerous projects and products for managing, processing, and analyzing graphs have exploded onto the scene in recent years. The sheer number of technologies makes it difficult to keep track of these tools and how they differ, even for those of us who are active in the space. This section provides a high-level framework for making sense of the emerging graph landscape.

从 10,000 英尺的高度来看,我们可以将图形空间分为两部分:

From 10,000 feet, we can divide the graph space into two parts:

主要用于事务性在线图形持久性的技术,通常直接从应用程序实时访问
这些技术被称为图形数据库是本书的重点。它们相当于关系世界中的“普通”联机事务处理 (OLTP) 数据库。
These technologies are called graph databases and are the main focus of this book. They are the equivalent of “normal” online transactional processing (OLTP) databases in the relational world.
主要用于离线图形分析的技术,通常以一系列批处理步骤执行
这些这些技术可以称为图形计算引擎。它们可以被认为与其他用于批量分析数据的技术属于同一类别,例如数据挖掘和联机分析处理(OLAP)。
These technologies can be called graph compute engines. They can be thought of as being in the same category as other technologies for analysis of data in bulk, such as data mining and online analytical processing (OLAP).

笔记

另一种划分图空间的方法是查看各种技术所采用的图模型。目前有三种主要的图数据模型:属性图、资源描述框架 (RDF) 三元组和超图。我们在附录 A中详细描述了这些内容。市场上大多数流行的图形数据库都使用属性图模型的变体,因此,我们将在本书的其余部分使用该模型。

Another way to slice the graph space is to look at the graph models employed by the various technologies. There are three dominant graph data models: the property graph, Resource Description Framework (RDF) triples, and hypergraphs. We describe these in detail in Appendix A. Most of the popular graph databases on the market use a variant of the property graph model, and consequently, it’s the model we’ll use throughout the remainder of this book.


图形数据库

Graph Databases

图形数据库管理系统(以下称为图形数据库)是一种在线数据库管理系统,具有创建、读取、更新和删除 (CRUD) 方法,可公开图形数据模型。图形数据库通常是为与事务 (OLTP) 系统一起使用而构建的。因此,它们通常针对事务性能进行了优化,并且在设计时考虑了事务完整性和操作可用性。

A graph database management system (henceforth, a graph database) is an online database management system with Create, Read, Update, and Delete (CRUD) methods that expose a graph data model. Graph databases are generally built for use with transactional (OLTP) systems. Accordingly, they are normally optimized for transactional performance, and engineered with transactional integrity and operational availability in mind.

在研究图形数据库技术时,我们应该考虑图形数据库的两个属性:

There are two properties of graph databases we should consider when investigating graph database technologies:

底层存储
一些图形数据库使用针对图形存储和管理进行优化和设计的原生图形存储。然而,并非所有图形数据库技术都使用原生图形存储。有些将图形数据序列化到关系数据库、面向对象的数据库或其他通用数据存储中。
Some graph databases use native graph storage that is optimized and designed for storing and managing graphs. Not all graph database technologies use native graph storage, however. Some serialize the graph data into a relational database, an object-oriented database, or some other general-purpose data store.
处理引擎
一些定义要求图形数据库使用无索引邻接,这意味着连接的节点在数据库中物理上“指向”彼此。2这里,我们采取一个稍微更广泛的视角:任何从用户角度来看行为像图形数据库(即通过 CRUD 操作公开图形数据模型)的数据库都符合图形数据库的条件。然而,我们确实承认无索引邻接具有显著的性能优势,因此使用术语本机图形处理来描述利用无索引的图形数据库邻接。
Some definitions require that a graph database use index-free adjacency, meaning that connected nodes physically “point” to each other in the database.2 Here we take a slightly broader view: any database that from the user’s perspective behaves like a graph database (i.e., exposes a graph data model through CRUD operations) qualifies as a graph database. We do acknowledge, however, the significant performance advantages of index-free adjacency, and therefore use the term native graph processing to describe graph databases that leverage index-free adjacency.

笔记

值得注意的是,原生图形存储和原生图形处理没有好坏之分——它们只是经典的工程权衡。原生图形存储的好处在于,其专用堆栈专为性能和可扩展性而设计。相比之下,非原生图形存储的好处在于,它通常依赖于成熟的非图形后端(如 MySQL),其生产特性为运营团队所熟知。原生图形处理(无索引邻接)有利于提高遍历性能,但代价是使一些不使用遍历的查询变得困难或占用大量内存。

It’s important to note that native graph storage and native graph processing are neither good nor bad — they’re simply classic engineering trade-offs. The benefit of native graph storage is that its purpose-built stack is engineered for performance and scalability. The benefit of nonnative graph storage, in contrast, is that it typically depends on a mature nongraph backend (such as MySQL) whose production characteristics are well understood by operations teams. Native graph processing (index-free adjacency) benefits traversal performance, but at the expense of making some queries that don’t use traversals difficult or memory intensive.


关系是图形数据模型的头等公民。在其他数据库管理系统中情况并非如此,在这些系统中,我们必须使用外键或带外处理(如 map-reduce)来推断实体之间的联系。通过将节点和关系的简单抽象组装成连接的结构,图形数据库使我们能够构建与我们的问题领域紧密映射的任意复杂模型。与使用传统关系数据库和其他 NOSQL(不仅仅是 SQL)存储生成的模型相比,生成的模型更简单,同时更具表现力。

Relationships are first-class citizens of the graph data model. This is not the case in other database management systems, where we have to infer connections between entities using things like foreign keys or out-of-band processing such as map-reduce. By assembling the simple abstractions of nodes and relationships into connected structures, graph databases enable us to build arbitrarily sophisticated models that map closely to our problem domain. The resulting models are simpler and at the same time more expressive than those produced using traditional relational databases and the other NOSQL (Not Only SQL) stores.

图 1-3根据存储和处理模型对当今市场上的一些图形数据库进行了图形概览。

Figure 1-3 shows a pictorial overview of some of the graph databases on the market today, based on their storage and processing models.

格数据库 0103
图 1-3。图形数据库空间概览

图形计算引擎

Graph Compute Engines

图形计算引擎是一种支持针对大型数据集运行全局图形计算算法的技术。图形计算引擎旨在执行诸如识别数据中的群集之类的操作,或回答诸如“社交网络中每个人平均有多少关系?”之类的问题。

A graph compute engine is a technology that enables global graph computational algorithms to be run against large datasets. Graph compute engines are designed to do things like identify clusters in your data, or answer questions such as, “how many relationships, on average, does everyone in a social network have?”

由于注重全局查询,图形计算引擎通常针对批量扫描和处理大量信息进行了优化,在这方面它们类似于关系世界中使用的其他批量分析技术,例如数据挖掘和 OLAP。虽然一些图形计算引擎包含图形存储层,但其他(可以说是大多数)图形计算引擎仅关注处理从外部源输入的数据,然后将结果返回以存储在其他地方。

Because of their emphasis on global queries, graph compute engines are normally optimized for scanning and processing large amounts of information in batches, and in that respect they are similar to other batch analysis technologies, such as data mining and OLAP, in use in the relational world. Whereas some graph compute engines include a graph storage layer, others (and arguably most) concern themselves strictly with processing data that is fed in from an external source, and then returning the results for storage elsewhere.

图 1-4显示了部署图形计算引擎的常见架构。该架构包括以下系统:具有 OLTP 属性的记录 (SOR) 数据库(例如 MySQL、Oracle 或 Neo4j),它在运行时为应用程序(最终是用户)的查询提供服务并响应请求。提取、转换和加载 (ETL) 作业定期将数据从记录系统数据库移动到图形计算引擎中进行离线查询和分析。

Figure 1-4 shows a common architecture for deploying a graph compute engine. The architecture includes a system of record (SOR) database with OLTP properties (such as MySQL, Oracle, or Neo4j), which services requests and responds to queries from the application (and ultimately the users) at runtime. Periodically, an Extract, Transform, and Load (ETL) job moves data from the system of record database into the graph compute engine for offline querying and analysis.

格数据库 0104
图 1-4.典型图形计算引擎部署的高级视图

各种各样的存在不同类型的图形计算引擎。最值得注意的是内存/单机图形计算引擎,如Cassovary分布式图形计算引擎,如PegasusGiraph。大多数分布式图形计算引擎都基于Google 撰写的Pregel 白皮书,该白皮书描述了 Google 用于对网页进行排名的图形计算引擎。

A variety of different types of graph compute engines exist. Most notably there are in-memory/single machine graph compute engines like Cassovary and distributed graph compute engines like Pegasus or Giraph. Most distributed graph compute engines are based on the Pregel white paper, authored by Google, which describes the graph compute engine Google uses to rank pages.

图形数据库的力量

The Power of Graph Databases

尽管几乎任何事物都可以用图形建模,但我们生活在一个务实的世界,有预算、项目时间表、企业标准和商品化技能。图形数据库提供了一种强大而新颖的数据建模技术,这本身并不足以成为取代一个完善、易于理解的数据平台的充分理由;还必须有直接且非常显著的实际利益。对于图形数据库而言,这种动机以一组用例和数据模式的形式存在,当它们以图形形式实现时,其性能会提高一个或多个数量级,并且与聚合的批处理相比,其延迟要低得多。除了这种性能优势之外,图形数据库还提供了极其灵活的数据模型和与当今敏捷软件交付实践相一致的交付模式。

Notwithstanding the fact that just about anything can be modeled as a graph, we live in a pragmatic world of budgets, project time lines, corporate standards, and commoditized skillsets. That a graph database provides a powerful but novel data modeling technique does not in itself provide sufficient justification for replacing a well-established, well-understood data platform; there must also be an immediate and very significant practical benefit. In the case of graph databases, this motivation exists in the form of a set of use cases and data patterns whose performance improves by one or more orders of magnitude when implemented in a graph, and whose latency is much lower compared to batch processing of aggregates. On top of this performance benefit, graph databases offer an extremely flexible data model, and a mode of delivery aligned with today’s agile software delivery practices.

表现

Performance

因此,选择图形数据库的一个令人信服的理由是,与关系数据库和 NOSQL 存储相比,图形数据库在处理连接数据时的性能大幅提升。与关系数据库相比,随着数据集的增大,连接密集型查询性能会下降,而图形数据库的性能往往保持相对稳定,即使数据集不断增长。这是因为查询局限于图形的一部分。因此,每个查询的执行时间仅与为满足该查询而遍历的图形部分的大小成正比,而不是与整个图形的大小成正比。

One compelling reason, then, for choosing a graph database is the sheer performance increase when dealing with connected data versus relational databases and NOSQL stores. In contrast to relational databases, where join-intensive query performance deteriorates as the dataset gets bigger, with a graph database performance tends to remain relatively constant, even as the dataset grows. This is because queries are localized to a portion of the graph. As a result, the execution time for each query is proportional only to the size of the part of the graph traversed to satisfy that query, rather than the size of the overall graph.

灵活性

Flexibility

作为开发人员和数据架构师,我们希望按照领域要求的方式连接数据,从而允许结构和模式随着我们对问题空间的了解不断加深而出现,而不是在我们对数据的真实形状和复杂性了解最少时就预先强加。图形数据库直接满足了这一需求。正如我们在第 3 章中展示的那样,图形数据模型以一种使 IT 能够跟上业务速度的方式表达和适应业务需求。

As developers and data architects, we want to connect data as the domain dictates, thereby allowing structure and schema to emerge in tandem with our growing understanding of the problem space, rather than being imposed upfront, when we know least about the real shape and intricacies of the data. Graph databases address this want directly. As we show in Chapter 3, the graph data model expresses and accommodates business needs in a way that enables IT to move at the speed of business.

图具有天然的可添加性,这意味着我们可以在现有结构中添加新类型的关系、新节点、新​​标签和新子图,而不会干扰现有查询和应用程序功能。这些因素通常对开发人员的工作效率和项目风险具有积极影响。由于图模型的灵活性,我们不必提前详尽地建模我们的领域——面对不断变化的业务需求,这种做法几乎是愚蠢的。图的可添加性还意味着我们倾向于执行更少的迁移,从而降低维护开销和风险。

Graphs are naturally additive, meaning we can add new kinds of relationships, new nodes, new labels, and new subgraphs to an existing structure without disturbing existing queries and application functionality. These things have generally positive implications for developer productivity and project risk. Because of the graph model’s flexibility, we don’t have to model our domain in exhaustive detail ahead of time — a practice that is all but foolhardy in the face of changing business requirements. The additive nature of graphs also means we tend to perform fewer migrations, thereby reducing maintenance overhead and risk.

敏捷

Agility

我们希望能够使用与当今增量和迭代软件交付实践相一致的技术,使我们的数据模型与应用程序的其余部分同步发展。现代图形数据库使我们能够进行无摩擦的开发和优雅的系统维护。特别是,图形数据模型的无模式特性,加上图形数据库应用程序的可测试特性编程接口(API)和查询语言,使我们能够以可控的方式开发应用程序。

We want to be able to evolve our data model in step with the rest of our application, using a technology aligned with today’s incremental and iterative software delivery practices. Modern graph databases equip us to perform frictionless development and graceful systems maintenance. In particular, the schema-free nature of the graph data model, coupled with the testable nature of a graph database’s application programming interface (API) and query language, empower us to evolve an application in a controlled manner.

同时,正是因为它们是无模式的,所以图形数据库缺乏我们在关系世界中熟悉的那种面向模式的数据治理机制。但这不是风险;相反,它需要一种更加明显和可操作的治理方式。正如我们在第 4 章中所展示的,治理通常以程序化的方式应用,使用测试来驱动数据模型和查询,以及断言依赖于图的业务规则。这不再是一种有争议的做法:与关系开发相比,图形数据库开发更符合当今敏捷和测试驱动的软件开发实践,允许图形数据库支持的应用程序随着不断变化的业务环境而发展。

At the same time, precisely because they are schema free, graph databases lack the kind of schema-oriented data governance mechanisms we’re familiar with in the relational world. But this is not a risk; rather, it calls forth a far more visible and actionable kind of governance. As we show in Chapter 4, governance is typically applied in a programmatic fashion, using tests to drive out the data model and queries, as well as assert the business rules that depend upon the graph. This is no longer a controversial practice: more so than relational development, graph database development aligns well with today’s agile and test-driven software development practices, allowing graph database–backed applications to evolve in step with changing business environments.

概括

Summary

在本章中,我们回顾了图形属性模型,这是一种用于表示连接数据的简单但富有表现力的工具。属性图以富有表现力和灵活性的方式捕获复杂域,而图形数据库则使开发操纵图形模型的应用程序变得容易。

In this chapter we’ve reviewed the graph property model, a simple yet expressive tool for representing connected data. Property graphs capture complex domains in an expressive and flexible fashion, while graph databases make it easy to develop applications that manipulate our graph models.

在下一章中,我们将更详细地介绍几种不同的技术如何应对关联数据的挑战,从关系数据库开始,到聚合 NOSQL 存储,最后是图形数据库。在讨论过程中,我们将了解为什么图形和图形数据库为关联数据的建模、存储和查询提供了最佳方式。后面的章节将继续展示如何设计和实现基于图形数据库的解决方案。

In the next chapter we’ll look in more detail at how several different technologies address the challenge of connected data, starting with relational databases, moving onto aggregate NOSQL stores, and ending with graph databases. In the course of the discussion, we’ll see why graphs and graph databases provide the best means for modeling, storing, and querying connected data. Later chapters then go on to show how to design and implement a graph database–based solution.

1有关图论的介绍,请参阅 Richard J. Trudeau 的《图论导论》(Dover,1993 年)和Gary Chartrand,《图形理论入门》(Dover,1985 年)。有关图形如何洞察复杂事件和行为的出色介绍,请参阅 David Easley 和 Jon Kleinberg 的《网络、人群和市场:关于高度互联世界的推理》(剑桥大学出版社,2010 年)。

1 For introductions to graph theory, see Richard J. Trudeau, Introduction To Graph Theory (Dover, 1993) and Gary Chartrand, Introductory Graph Theory (Dover, 1985). For an excellent introduction to how graphs provide insight into complex events and behaviors, see David Easley and Jon Kleinberg, Networks, Crowds, and Markets: Reasoning about a Highly Connected World (Cambridge University Press, 2010).

2请参阅 Rodriguez, Marko A. 和 Peter Neubauer。2011 年。 “图形遍历模式。”图形数据管理:技术和应用,Sherif Sakr 和 Eric Pardede 编辑,第 29-46 页。宾夕法尼亚州赫尔希:IGI Global。

2 See Rodriguez, Marko A., and Peter Neubauer. 2011. “The Graph Traversal Pattern.” In Graph Data Management: Techniques and Applications, ed. Sherif Sakr and Eric Pardede, 29-46. Hershey, PA: IGI Global.

第 2 章存储连接数据的选项

Chapter 2. Options for Storing Connected Data

我们生活在一个互联互通的世界。为了繁荣和进步,我们需要了解和影响我们周围的联系网络。

We live in a connected world. To thrive and progress, we need to understand and influence the web of connections that surrounds us.

当今的技术如何应对关联数据的挑战?在本章中,我们将研究关系数据库和聚合 NOSQL 存储如何管理图形和关联数据,并将其性能与图形数据库进行比较。对于有兴趣探索 NOSQL 主题的读者,附录 A介绍了四种主要的 NOSQL 数据库类型。

How do today’s technologies deal with the challenge of connected data? In this chapter we look at how relational databases and aggregate NOSQL stores manage graphs and connected data, and compare their performance to that of a graph database. For readers interested in exploring the topic of NOSQL, Appendix A describes the four major types of NOSQL databases.

关系数据库缺乏关系

Relational Databases Lack Relationships

几十年来,开发人员一直试图在关系数据库中容纳互联的半结构化数据集。但是,尽管关系数据库最初设计用于编纂纸质表格和表格结构(它们在这方面做得非常好),但它们在尝试模拟现实世界中突然出现的临时、特殊关系时却举步维艰。具有讽刺意味的是,关系数据库在处理关系方面表现不佳。

For several decades, developers have tried to accommodate connected, semi-structured datasets inside relational databases. But whereas relational databases were initially designed to codify paper forms and tabular structures — something they do exceedingly well — they struggle when attempting to model the ad hoc, exceptional relationships that crop up in the real world. Ironically, relational databases deal poorly with relationships.

关系确实存在于关系数据库的术语中,但仅在建模时作为连接表的一种方式。在上一章关于连接数据的讨论中,我们提到我们经常需要消除连接实体的关系的语义歧义,以及限定它们的权重或强度。关系关系不做这样的事情。更糟糕的是,随着异常数据的增加,数据集的整体结构变得更加复杂和不统一,关系模型将承受巨大的连接表、稀疏填充的行和大量的空检查逻辑的负担。在关系世界中,连通性的增加转化为连接的增加,这会阻碍性能,并使我们难以根据不断变化的业务需求改进现有数据库。

Relationships do exist in the vernacular of relational databases, but only at modeling time, as a means of joining tables. In our discussion of connected data in the previous chapter, we mentioned we often need to disambiguate the semantics of the relationships that connect entities, as well as qualify their weight or strength. Relational relations do nothing of the sort. Worse still, as outlier data multiplies, and the overall structure of the dataset becomes more complex and less uniform, the relational model becomes burdened with large join tables, sparsely populated rows, and lots of null-checking logic. The rise in connectedness translates in the relational world into increased joins, which impede performance and make it difficult for us to evolve an existing database in response to changing business needs.

图 2-1显示了在以客户为中心的事务应用程序中存储客户订单的关系模式。

Figure 2-1 shows a relational schema for storing customer orders in a customer-centric, transactional application.

格数据库 0201
图 2-1。语义关系隐藏在关系数据库中

应用程序对该模式的设计产生了巨大的影响,使得某些查询非常容易,而其他查询则更加困难:

The application exerts a tremendous influence over the design of this schema, making some queries very easy, and others more difficult:

  • 加入表格添加意外的复杂性;它们将业务数据与外键元数据混合在一起。
  • Join tables add accidental complexity; they mix business data with foreign key metadata.
  • 外键约束增加了额外的开发和维护开销只是为了使数据库正常工作
  • Foreign key constraints add additional development and maintenance overhead just to make the database work.
  • 尽管存在模式,但具有可空列的稀疏表仍需要在代码中进行特殊检查。
  • Sparse tables with nullable columns require special checking in code, despite the presence of a schema.
  • 几个昂贵的连接只是为了发现顾客买了什么。
  • Several expensive joins are needed just to discover what a customer bought.
  • 互惠的查询成本更高。“客户购买了哪些产品?”与“哪些客户购买了该产品?”相比,成本相对较低,而后者是推荐系统的基础。我们可以引入索引,但即使有了索引,随着递归程度的增加,诸如“哪些购买该产品的客户也购买了该产品?”之类的递归问题也会变得非常昂贵。
  • Reciprocal queries are even more costly. “What products did a customer buy?” is relatively cheap compared to “which customers bought this product?”, which is the basis of recommendation systems. We could introduce an index, but even with an index, recursive questions such as “which customers buying this product also bought that product?” quickly become prohibitively expensive as the degree of recursion increases.

关系数据库面临高度连通的域。为了了解在关系数据库中执行连通查询的成本,我们将研究社交网络域中的一些简单查询和不太简单的查询。

Relational databases struggle with highly connected domains. To understand the cost of performing connected queries in a relational database, we’ll look at some simple and not-so-simple queries in a social network domain.

图 2-2显示了用于记录友谊关系的简单连接表排列。

Figure 2-2 shows a simple join-table arrangement for recording friendships.

格数据库 0202
图 2-2。在关系数据库中对朋友和朋友的朋友进行建模

问“谁是鲍勃的朋友?”很容易,如示例 2-1所示。

Asking “who are Bob’s friends?” is easy, as shown in Example 2-1.

例 2-1. Bob 的朋友
SELECT p1.Person
FROM Person p1 JOIN PersonFriend
  ON PersonFriend.FriendID = p1.ID
JOIN Person p2
  ON PersonFriend.PersonID = p2.ID
WHERE p2.Person = 'Bob'
SELECT p1.Person
FROM Person p1 JOIN PersonFriend
  ON PersonFriend.FriendID = p1.ID
JOIN Person p2
  ON PersonFriend.PersonID = p2.ID
WHERE p2.Person = 'Bob'

根据我们的样本数据,答案是AliceZach。这不是一个特别昂贵或困难的查询,因为它限制了使用过滤器考虑的行数WHERE Person.person='Bob'

Based on our sample data, the answer is Alice and Zach. This isn’t a particularly expensive or difficult query, because it constrains the number of rows under consideration using the filter WHERE Person.person='Bob'.

友谊并不总是一种反身关系,所以在示例 2-2中,我们提出一个相互查询,即“谁是鲍勃的朋友?”

Friendship isn’t always a reflexive relationship, so in Example 2-2, we ask the reciprocal query, which is, “who is friends with Bob?”

例 2-2.谁是 Bob 的朋友?
SELECT p1.Person
FROM Person p1 JOIN PersonFriend
  ON PersonFriend.PersonID = p1.ID
JOIN Person p2
  ON PersonFriend.FriendID = p2.ID
WHERE p2.Person = 'Bob'
SELECT p1.Person
FROM Person p1 JOIN PersonFriend
  ON PersonFriend.PersonID = p1.ID
JOIN Person p2
  ON PersonFriend.FriendID = p2.ID
WHERE p2.Person = 'Bob'

这个查询的答案是Alice;遗憾的是,Zach不被视为Bob朋友。这个相互查询仍然很容易实现,但在数据库方面它更昂贵,因为数据库现在必须考虑表中的所有行PersonFriend

The answer to this query is Alice; sadly, Zach doesn’t consider Bob to be a friend. This reciprocal query is still easy to implement, but on the database side it’s more expensive, because the database now has to consider all the rows in the PersonFriend table.

我们可以添加索引,但这仍然需要昂贵的间接层。当我们问“谁是我朋友的朋友”时,问题就变得更加棘手了。SQL 中的层次结构使用递归连接,这使得查询在语法和计算上更加复杂,如示例 2-3所示。(一些关系数据库为此提供了语法糖 - 例如,Oracle 有一个CONNECT BY函数 - 它简化了查询,但没有降低底层的计算复杂性。)

We can add an index, but this still involves an expensive layer of indirection. Things become even more problematic when we ask, “who are the friends of my friends?” Hierarchies in SQL use recursive joins, which make the query syntactically and computationally more complex, as shown in Example 2-3. (Some relational databases provide syntactic sugar for this — for instance, Oracle has a CONNECT BY function — which simplifies the query, but not the underlying computational complexity.)

示例 2-3 Alice 的朋友的朋友
SELECT p1.Person AS PERSON, p2.Person AS FRIEND_OF_FRIEND
FROM PersonFriend pf1 JOIN Person p1
  ON pf1.PersonID = p1.ID
JOIN PersonFriend pf2
  ON pf2.PersonID = pf1.FriendID
JOIN Person p2
  ON pf2.FriendID = p2.ID
WHERE p1.Person = 'Alice' AND pf2.FriendID <> p1.ID
SELECT p1.Person AS PERSON, p2.Person AS FRIEND_OF_FRIEND
FROM PersonFriend pf1 JOIN Person p1
  ON pf1.PersonID = p1.ID
JOIN PersonFriend pf2
  ON pf2.PersonID = pf1.FriendID
JOIN Person p2
  ON pf2.FriendID = p2.ID
WHERE p1.Person = 'Alice' AND pf2.FriendID <> p1.ID

此查询在计算上非常复杂,即使它只处理 Alice 朋友的朋友,并且没有深入 Alice 的社交网络。我们越深入网络,事情就会变得越复杂,成本也越高。虽然可以在合理的时间内得到“谁是我朋友的朋友的朋友?”这个问题的答案,但由于递归连接表的计算和空间复杂性,扩展到四度、五度或六度友谊的查询会显著恶化。

This query is computationally complex, even though it only deals with the friends of Alice’s friends, and goes no deeper into Alice’s social network. Things get more complex and more expensive the deeper we go into the network. Though it’s possible to get an answer to the question “who are my friends-of-friends-of-friends?” in a reasonable period of time, queries that extend to four, five, or six degrees of friendship deteriorate significantly due to the computational and space complexity of recursively joining tables.

每当我们尝试在关系数据库中建模和查询连接性时,我们都会逆势而行。除了刚刚概述的查询和计算复杂性之外,我们还必须处理模式这把双刃剑。通常,模式既过于僵化,又过于脆弱。为了颠覆其僵化性,我们创建了具有许多可空列的稀疏填充表,并编写了处理异常情况的代码 — 所有这些都是因为没有真正的一刀切模式来适应我们遇到的数据的多样性。这增加了耦合,几乎破坏了任何凝聚力的表象。其脆弱性表现为随着应用程序的发展,从一种模式迁移到另一种模式所需的额外努力和谨慎。

We work against the grain whenever we try to model and query connectedness in a relational database. Besides the query and computational complexity just outlined, we also have to deal with the double-edged sword of schema. More often than not, schema proves to be both too rigid and too brittle. To subvert its rigidity we create sparsely populated tables with many nullable columns, and code to handle the exceptional cases — all because there’s no real one-size-fits-all schema to accommodate the variety in the data we encounter. This increases coupling and all but destroys any semblance of cohesion. Its brittleness manifests itself as the extra effort and care required to migrate from one schema to another as an application evolves.

NOSQL 数据库也缺乏关系

NOSQL Databases Also Lack Relationships

大多数 NOSQL 数据库——无论面向键值、文档或列 — 存储不相连的文档/值/列集。这使得它们难以用于相连的数据和图表。

Most NOSQL databases — whether key-value-, document-, or column-oriented — store sets of disconnected documents/values/columns. This makes it difficult to use them for connected data and graphs.

为此类存储添加关系的一个著名策略是将聚合的标识符嵌入另一个聚合的字段中 — 有效地引入外键。但这需要在应用程序级别连接聚合,而这很快就会变得非常昂贵。

One well-known strategy for adding relationships to such stores is to embed an aggregate’s identifier inside the field belonging to another aggregate — effectively introducing foreign keys. But this requires joining aggregates at the application level, which quickly becomes prohibitively expensive.

当我们查看聚合存储模型(例如图 2-3order: 1234中的模型)时,我们会想象我们可以看到关系。在以 开头的记录中看到对 的引用,我们推断出和user: Alice之间存在联系。这让我们误以为我们可以使用键和值来管理图。user: Aliceorder: 1234

When we look at an aggregate store model, such as the one in Figure 2-3, we imagine we can see relationships. Seeing a reference to order: 1234 in the record beginning user: Alice, we infer a connection between user: Alice and order: 1234. This gives us false hope that we can use keys and values to manage graphs.

格数据库 0203
图 2-3。在聚合存储中具体化关系

图 2-3中,我们推断出一些属性值实际上是对数据库中其他地方的外部聚合的引用。但将这些推断转化为可导航的结构并非易事,因为聚合之间的关系不是数据模型中的一等公民——大多数聚合存储仅以嵌套映射的形式为聚合内部提供结构。相反,使用数据库的应用程序必须从这些扁平、断开的数据结构中构建关系。我们还必须确保应用程序与其余数据一起更新或删除这些外部聚合引用。如果不这样做,存储将积累悬空引用,这可能会损害数据质量和查询性能。

In Figure 2-3 we infer that some property values are really references to foreign aggregates elsewhere in the database. But turning these inferences into a navigable structure doesn’t come for free, because relationships between aggregates aren’t first-class citizens in the data model — most aggregate stores furnish only the insides of aggregates with structure, in the form of nested maps. Instead, the application that uses the database must build relationships from these flat, disconnected data structures. We also have to ensure that the application updates or deletes these foreign aggregate references in tandem with the rest of the data. If this doesn’t happen, the store will accumulate dangling references, which can harm data quality and query performance.

这种方案还有另一个弱点。由于没有“指向”后方的标识符(当然,外部聚合“链接”不是反身的),我们失去了在数据库上运行其他有趣查询的能力。例如,对于图 2-3所示的结构,询问数据库谁购买了特定产品(可能是为了根据客户资料提出建议)是一项昂贵的操作。如果我们想回答这类问题,我们很可能最终会导出数据集并通过一些外部计算基础设施对其进行处理,例如Hadoop,强力计算结果。或者,我们可以回顾性地插入指向后方的外部聚合引用,然后查询结果。无论哪种方式,结果都是潜在的。

There’s another weak point in this scheme. Because there are no identifiers that “point” backward (the foreign aggregate “links” are not reflexive, of course), we lose the ability to run other interesting queries on the database. For example, with the structure shown in Figure 2-3, asking the database who has bought a particular product — perhaps for the purpose of making a recommendation based on a customer profile — is an expensive operation. If we want to answer this kind of question, we will likely end up exporting the dataset and processing it via some external compute infrastructure, such as Hadoop, to brute-force compute the result. Alternatively, we can retrospectively insert backward-pointing foreign aggregate references, and then query for the result. Either way, the results will be latent.

人们很容易认为聚合存储在连接数据方面在功能上等同于图形数据库。但事实并非如此。聚合存储不维护连接数据的一致性,也不支持已知的作为无索引邻接,其中元素包含与其邻居的直接链接。因此,对于连接数据问题,聚合存储必须采用固有的潜在方法来创建和查询数据模型之外的关系。

It’s tempting to think that aggregate stores are functionally equivalent to graph databases with respect to connected data. But this is not the case. Aggregate stores do not maintain consistency of connected data, nor do they support what is known as index-free adjacency, whereby elements contain direct links to their neighbors. As a result, for connected data problems, aggregate stores must employ inherently latent methods for creating and querying relationships outside the data model.

让我们看看这些限制是如何体现出来的。图 2-4显示了使用聚合存储中的文档实现的小型社交网络。

Let’s see how some of these limitations manifest themselves. Figure 2-4 shows a small social network as implemented using documents in an aggregate store.

格数据库 0204
图 2-4.聚合存储中编码的小型社交网络

有了这种结构,很容易找到用户的密友——当然,前提是应用程序已经尽职尽责地确保存储在friends属性中的标识符与数据库中的其他记录 ID 一致。在这种情况下,我们只需通过 ID 查找密友,这需要多次索引查找(每个朋友一次),但不需要对整个数据集进行强力扫描。这样做,我们会发现,例如,Bob认为AliceZach是朋友。

With this structure, it’s easy to find a user’s immediate friends — assuming, of course, the application has been diligent in ensuring identifiers stored in the friends property are consistent with other record IDs in the database. In this case we simply look up immediate friends by their ID, which requires numerous index lookups (one for each friend) but no brute-force scans of the entire dataset. Doing this, we’d find, for example, that Bob considers Alice and Zach to be friends.

但友谊并不总是对称的。如果我们想问“谁是鲍勃的朋友?”而不是“谁是鲍勃的朋友?”,该怎么办?这是一个更难回答的问题,在这种情况下,我们唯一的选择就是在整个数据集中进行强力扫描,寻找friends包含的条目Bob

But friendship isn’t always symmetric. What if we’d like to ask “who is friends with Bob?” rather than “who are Bob’s friends?” That’s a more difficult question to answer, and in this case our only option would be to brute-force scan across the whole dataset looking for friends entries that contain Bob.

为了避免必须处理整个数据集,我们可以通过添加反向链接来非规范化存储模型。friended_by为每个用户添加第二个属性,称为 perhaps ,我们可以列出与该用户相关的传入友谊关系。但这并不是免费的。首先,我们必须支付增加写入延迟的初始和持续成本,以及存储额外元数据的增加的磁盘利用率成本。除此之外但是,遍历链接仍然很昂贵,因为每次跳跃都需要索引查找。这是因为聚合没有局部性的概念,不像图形数据库,图形数据库通过真实的(而不是具体化的)关系自然地提供无索引的邻接。通过在非本地存储之上实现图形结构,我们可以获得部分连通性的一些好处,但成本很高。

To avoid having to process the entire dataset, we could denormalize the storage model by adding backward links. Adding a second property, called perhaps friended_by, to each user, we can list the incoming friendship relations associated with that user. But this doesn’t come for free. For starters, we have to pay the initial and ongoing cost of increased write latency, plus the increased disk utilization cost for storing the additional metadata. On top of that, traversing the links remains expensive, because each hop requires an index lookup. This is because aggregates have no notion of locality, unlike graph databases, which naturally provide index-free adjacency through real — not reified — relationships. By implementing a graph structure atop a nonnative store, we get some of the benefits of partial connectedness, but at substantial cost.

当遍历深度超过一跳时,这一巨大的成本会被放大。朋友很容易,但想象一下尝试实时计算朋友的朋友或朋友的朋友的朋友。对于这种数据库来说,这是不切实际的,因为遍历虚假关系并不便宜。这不仅限制了您扩展社交网络的机会,还减少了有利可图的推荐,错过了数据中心的故障设备,并让欺诈性购买活动漏网。许多系统试图保持图形处理的外观,但不可避免地它是批量完成的,无法提供用户所需的实时交互。

This substantial cost is amplified when it comes to traversing deeper than just one hop. Friends are easy enough, but imagine trying to compute — in real time — friends-of-friends, or friends-of-friends-of-friends. That’s impractical with this kind of database because traversing a fake relationship isn’t cheap. This not only limits your chances of expanding your social network, it also reduces profitable recommendations, misses faulty equipment in your data center, and lets fraudulent purchasing activity slip through the net. Many systems try to maintain the appearance of graph-like processing, but inevitably it’s done in batches and doesn’t provide the real-time interaction that users demand.

图形数据库拥抱关系

Graph Databases Embrace Relationships

前面的例子已经处理了隐式连接的数据。作为用户,我们推断实体之间的语义依赖关系,但数据模型(以及数据库本身)却看不到这些连接。为了弥补这一点,我们的应用程序必须从手头的扁平、断开连接的数据中创建一个网络,然后处理在非规范化存储中出现的任何慢查询和潜在写入。

The previous examples have dealt with implicitly connected data. As users we infer semantic dependencies between entities, but the data models — and the databases themselves — are blind to these connections. To compensate, our applications must create a network out of the flat, disconnected data at hand, and then deal with any slow queries and latent writes across denormalized stores that arise.

我们真正想要的是一幅整体的连贯图景,包括元素之间的连接。与我们刚刚看到的存储不同,在图形世界中,连接数据被存储为连接数据。域中有连接,数据中就有连接。例如,考虑图 2-5中所示的社交网络。

What we really want is a cohesive picture of the whole, including the connections between elements. In contrast to the stores we’ve just looked at, in the graph world, connected data is stored as connected data. Where there are connections in the domain, there are connections in the data. For example, consider the social network shown in Figure 2-5.

格数据库 0205
图 2-5.用图表轻松模拟朋友、同事、员工和(单恋)恋人

在这个社交网络中,就像现实世界中许多连接数据的情况一样,实体之间的连接在整个领域中并不表现出一致性——该领域是可变结构的。社交网络是一个流行的密集连接、可变结构网络的例子,它拒绝被一刀切的模式所捕获,也不愿意被随意地分割成不相连的聚合体。我们简单的朋友网络在规模上有所增长(现在有多达六度的潜在朋友),表达也更加丰富。图形模型的灵活性使我们能够添加新的节点和新的关系,而不会损害现有网络或迁移数据——原始数据及其意图保持不变。

In this social network, as in so many real-world cases of connected data, the connections between entities don’t exhibit uniformity across the domain — the domain is variably-structured. A social network is a popular example of a densely connected, variably-structured network, one that resists being captured by a one-size-fits-all schema or conveniently split across disconnected aggregates. Our simple network of friends has grown in size (there are now potential friends up to six degrees away) and expressive richness. The flexibility of the graph model has allowed us to add new nodes and new relationships without compromising the existing network or migrating data — the original data and its intent remain intact.

该图提供了更丰富的网络图景。我们可以看到谁LOVES是谁(以及这种爱是否得到回报)。我们可以看到谁是COLLEAGUE_OF谁,谁是BOSS_OF他们所有人。我们可以看到谁不在市场上,因为他们是MARRIED_TO其他人;我们甚至可以看到我们社交网络中的反社会元素,以DISLIKES关系为代表。有了这个图,我们现在可以看看图形数据库在处理关联数据时的性能优势。

The graph offers a much richer picture of the network. We can see who LOVES whom (and whether that love is requited). We can see who is a COLLEAGUE_OF whom, and who is BOSS_OF them all. We can see who’s off the market, because they’re MARRIED_TO someone else; we can even see the antisocial elements in our otherwise social network, as represented by DISLIKES relationships. With this graph at our disposal, we can now look at the performance advantages of graph databases when dealing with connected data.

图中的关系自然形成路径。查询(或遍历)图涉及遵循路径。由于数据模型本质上是面向路径的,因此大多数基于路径的图形数据库操作与数据的布局方式高度一致,从而使其极其高效。Partner 和 Vukotic在他们的著作《Neo4j in Action》中,使用关系存储和 Neo4j 进行了一项实验。比较表明,图形数据库(在本例中为 Neo4j 及其遍历框架)对于连接数据的处理速度明显快于关系存储。

Relationships in a graph naturally form paths. Querying — or traversing — the graph involves following paths. Because of the fundamentally path-oriented nature of the data model, the majority of path-based graph database operations are highly aligned with the way in which the data is laid out, making them extremely efficient. In their book Neo4j in Action, Partner and Vukotic perform an experiment using both a relational store and Neo4j. The comparison shows that the graph database (in this case, Neo4j and its Traversal Framework) is substantially quicker for connected data than a relational store.

Partner 和 Vukotic 的实验旨在寻找社交网络中的朋友的朋友,最大深度为 5。对于一个包含 1,000,000 人、每个人大约有 50 个朋友的社交网络,结果强烈表明图形数据库是连接数据的最佳选择,如表 2-1所示。

Partner and Vukotic’s experiment seeks to find friends-of-friends in a social network, to a maximum depth of five. For a social network containing 1,000,000 people, each with approximately 50 friends, the results strongly suggest that graph databases are the best choice for connected data, as we see in Table 2-1.

表 2-1在关系数据库中查找扩展好友与在 Neo4j 中高效查找
深度 RDBMS 执行时间 Neo4j 执行时间(秒) 返回的记录

2

2

0.016

0.016

0.01

0.01

~2500

~2500

3

3

30.267

30.267

0.168

0.168

~110,000

~110,000

4

4

1543.505

1543.505

1.359

1.359

约60万

~600,000

5

5

未完成

Unfinished

2.132

2.132

约80万

~800,000

在深度二(朋友的朋友)下,关系数据库和图形数据库的表现都足够好,我们可以考虑在在线系统中使用它们。尽管 Neo4j 查询的运行时间是关系查询的三分之二,但最终用户几乎不会注意到两者之间的毫秒差异。然而,当我们达到深度三(朋友的朋友的朋友的朋友)时,很明显关系数据库无法在合理的时间范围内处理查询:完成查询所需的 30 秒对于在线系统来说是完全不可接受的。相比之下,Neo4j 的响应时间保持相对平稳:仅需几分之一秒即可执行查询 - 对于在线系统来说绝对足够快。

At depth two (friends-of-friends), both the relational database and the graph database perform well enough for us to consider using them in an online system. Although the Neo4j query runs in two-thirds the time of the relational one, an end user would barely notice the difference in milliseconds between the two. By the time we reach depth three (friend-of-friend-of-friend), however, it’s clear that the relational database can no longer deal with the query in a reasonable time frame: the 30 seconds it takes to complete would be completely unacceptable for an online system. In contrast, Neo4j’s response time remains relatively flat: just a fraction of a second to perform the query — definitely quick enough for an online system.

在深度四时,关系数据库表现出严重的延迟,这使其对于在线系统来说几乎毫无用处。Neo4j 的时间也略有恶化,但这里的延迟对于响应迅速的在线系统来说处于可接受的边缘。最后,在深度五时,关系数据库完成查询的时间实在太长了。相比之下,Neo4j 大约需要两秒钟就能返回结果。在深度五时,几乎整个网络都是我们的朋友。因此,对于许多实际用例,我们可能会修剪结果,从而减少时间。

At depth four the relational database exhibits crippling latency, making it practically useless for an online system. Neo4j’s timings have deteriorated a little too, but the latency here is at the periphery of being acceptable for a responsive online system. Finally, at depth five, the relational database simply takes too long to complete the query. Neo4j, in contrast, returns a result in around two seconds. At depth five, it turns out that almost the entire network is our friend. Because of this, for many real-world use cases we’d likely trim the results, thereby reducing the timings.


笔记

两者合计当我们放弃中等规模的集合运算(它们都应该擅长这些运算)时,聚合存储和关系数据库的表现会很差。当我们尝试从图中挖掘路径信息时,速度会变慢,就像朋友的朋友的例子一样。我们无意过分贬低聚合存储或关系数据库。它们在擅长的事情上拥有良好的技术血统,但在管理关联数据时却存在不足。由于涉及大量索引查找,任何超过直接朋友或可能是朋友的朋友的浅层遍历的操作都会很慢。另一方面,图使用无索引邻接来确保遍历关联数据非常快速。

Both aggregate stores and relational databases perform poorly when we move away from modestly sized set operations — operations that they should both be good at. Things slow down when we try to mine path information from the graph, as with the friends-of-friends example. We don’t mean to unduly beat up on either aggregate stores or relational databases. They have a fine technology pedigree for the things they’re good at, but they fall short when managing connected data. Anything more than a shallow traversal of immediate friends, or possibly friends-of-friends, will be slow because of the number of index lookups involved. Graphs, on the other hand, use index-free adjacency to ensure that traversing connected data is extremely rapid.


社交网络示例有助于说明不同技术如何处理关联数据,但这是一个有效的用例吗?我们真的需要找到这种远方的“朋友”吗?也许不需要。但用任何其他领域代替社交网络,您就会发现我们体验到类似的性能、建模和维护优势。无论是音乐还是数据中心管理、生物信息学还是足球统计数据、网络传感器还是交易的时间序列,图表都能为我们的数据提供强大的洞察力。那么,让我们看看图表的另一个当代应用:根据用户的购买历史以及他的朋友、邻居和其他像他一样的人的购买历史推荐产品。通过这个例子,我们将汇集用户生活方式的几个独立方面,以提供准确且有利可图的建议。

The social network example helps illustrate how different technologies deal with connected data, but is it a valid use case? Do we really need to find such remote “friends”? Perhaps not. But substitute any other domain for the social network, and you’ll see we experience similar performance, modeling, and maintenance benefits. Whether music or data center management, bio-informatics or football statistics, network sensors or time-series of trades, graphs provide powerful insight into our data. Let’s look, then, at another contemporary application of graphs: recommending products based on a user’s purchase history and the histories of his friends, neighbors, and other people like him. With this example, we’ll bring together several independent facets of a user’s lifestyle to make accurate and profitable recommendations.

我们首先将用户的购买历史建模为关联数据。在图中,这很简单,只需将用户与其订单关联起来,然后将订单关联在一起即可提供购买历史,如图2-6所示。

We’ll start by modeling the purchase history of a user as connected data. In a graph, this is as simple as linking the user to her orders, and linking orders together to provide a purchase history, as shown in Figure 2-6.

图 2-6所示的图表提供了大量有关客户行为的洞察。我们可以看到用户的所有订单PLACED,并且可以轻松推断出每个订单的含义CONTAINS。然后,我们在这个核心域数据结构中添加了对几种众所周知的访问模式的支持。例如,用户经常希望查看他们的订单历史记录,因此我们在图表中添加了一个链接列表结构,该结构允许我们通过遵循传出MOST_RECENT关系来查找用户最近的订单。然后,我们可以通过遵循每个关系来遍历列表,进一步回溯时间。如果我们想向前移动时间,我们可以按照相反的方向PREVIOUS跟踪每个关系,或者添加一个相互关系。PREVIOUSNEXT

The graph shown in Figure 2-6 provides a great deal of insight into customer behavior. We can see all the orders a user has PLACED, and we can easily reason about what each order CONTAINS. To this core domain data structure we’ve then added support for several well-known access patterns. For example, users often want to see their order history, so we’ve added a linked list structure to the graph that allows us to find a user’s most recent order by following an outgoing MOST_RECENT relationship. We can then iterate through the list, going further back in time, by following each PREVIOUS relationship. If we want to move forward in time, we can follow each PREVIOUS relationship in the opposite direction, or add a reciprocal NEXT relationship.

现在我们可以开始做推荐了。如果我们注意到很多购买草莓冰淇淋的用户也会购买浓缩咖啡豆,我们就可以开始向通常只购买冰淇淋的用户推荐这些咖啡豆。但这是一个相当单一的推荐:我们可以做得更好为了增加我们图表的威力,我们可以将它与其他领域的图表连接起来。因为图表天生就是多维结构,所以很容易对数据提出更复杂的问题,以获得精细调整的细分市场。例如,我们可以让图表为我们找到“喜欢浓缩咖啡但不喜欢抱子甘蓝的人喜欢的所有冰淇淋口味,并且他们居住在特定社区。”

Now we can start to make recommendations. If we notice that many users who buy strawberry ice cream also buy espresso beans, we can start to recommend those beans to users who normally only buy the ice cream. But this is a rather one-dimensional recommendation: we can do much better. To increase our graph’s power, we can join it to graphs from other domains. Because graphs are naturally multidimensional structures, it’s then quite straightforward to ask more sophisticated questions of the data to gain access to a fine-tuned market segment. For example, we can ask the graph to find for us “all the flavors of ice cream liked by people who enjoy espresso but dislike Brussels sprouts, and who live in a particular neighborhood.”

格数据库 0206
图 2-6.以图表形式建模用户的订单历史记录

为了解释数据,我们可以将某人重复购买某件产品的程度视为他们是否喜欢该产品的指标。但我们如何定义“住在附近”呢?事实证明,地理空间坐标非常方便地建模为图形。表示地理空间坐标的最流行的结构之一称为R 树。R树一种图形索引,用于描述地理区域周围的边界框。使用这种结构,我们可以描述位置的重叠层次结构。例如,我们可以表示伦敦位于英国,邮政编码 SW11 1BD 位于巴特西,巴特西是伦敦的一个区,位于英格兰东南部,而伦敦又位于英国。由于英国邮政编码非常细化,我们可以使用该边界来定位具有相似品味的人。1

For the purpose of our interpretation of the data, we can consider the degree to which someone repeatedly buys a product to be indicative of whether or not they like that product. But how might we define “live in a neighborhood”? Well, it turns out that geospatial coordinates are very conveniently modeled as graphs. One of the most popular structures for representing geospatial coordinates is called an R-Tree. An R-Tree is a graph-like index that describes bounded boxes around geographies. Using such a structure we can describe overlapping hierarchies of locations. For example, we can represent the fact that London is in the UK, and that the postal code SW11 1BD is in Battersea, which is a district in London, which is in southeastern England, which, in turn, is in Great Britain. And because UK postal codes are fine-grained, we can use that boundary to target people with somewhat similar tastes.1


笔记

此类模式匹配查询在 SQL 中极难编写,在聚合存储中编写也非常费力,而且在这两种情况下,它们的性能往往非常差。另一方面,图形数据库针对这些类型的遍历和模式匹配查询进行了优化,在许多情况下可提供毫秒级的响应。此外,大多数图形数据库都提供了适合表达图形构造和图形查询的查询语言。在下一章中,我们将介绍 Cypher,这是一种模式匹配语言,专门针对我们使用图表描述图形的方式进行了调整。

Such pattern-matching queries are extremely difficult to write in SQL, and laborious to write against aggregate stores, and in both cases they tend to perform very poorly. Graph databases, on the other hand, are optimized for precisely these types of traversals and pattern-matching queries, providing in many cases millisecond responses. Moreover, most graph databases provide a query language suited to expressing graph constructs and graph queries. In the next chapter, we’ll look at Cypher, which is a pattern-matching language tuned to the way we tend to describe graphs using diagrams.


我们可以使用示例图向用户提供建议,但我们也可以使用它来使卖家受益。例如,给定某些购买模式(产品、典型订单成本等),我们可以确定特定交易是否可能存在欺诈行为。可以轻松在图中检测到给定用户常态之外的模式,然后标记以引起进一步关注(使用图数据挖掘文献中众所周知的相似性度量),从而降低卖家的风险。2

We can use our example graph to make recommendations to users, but we can also use it to benefit the seller. For example, given certain buying patterns (products, cost of typical order, and so on), we can establish whether a particular transaction is potentially fraudulent. Patterns outside of the norm for a given user can easily be detected in a graph and then flagged for further attention (using well-known similarity measures from the graph data-mining literature), thus reducing the risk for the seller.2

从数据从业者的角度来看,很明显图形数据库是处理复杂、结构多变、密集连接数据的最佳技术——也就是说,数据集非常复杂,以图形以外的任何形式处理时都很困难。

From the data practitioner’s point of view, it’s clear that the graph database is the best technology for dealing with complex, variably structured, densely connected data — that is, with datasets so sophisticated they are unwieldy when treated in any form other than a graph.

概括

Summary

在本章中,我们了解了关系数据库和 NOSQL 数据存储中的连通性如何要求开发人员在应用程序层实现数据处理,并将其与图形数据库进行了对比,在图形数据库中,连通性是头等大事。在下一章中,我们将更详细地讨论图形建模的主题。

In this chapter we’ve seen how connectedness in relational databases and NOSQL data stores requires developers to implement data processing in the application layer, and contrasted that with graph databases, where connectedness is a first-class citizen. In the next chapter, we look in more detail at the topic of graph modeling.

1 Neo4j-spatial 库可以方便地为我们处理 n 维多边形索引。请参阅https://github.com/neo4j-contrib/spatial

1 The Neo4j-spatial library conveniently takes care of n-dimensional polygon indexes for us. See https://github.com/neo4j-contrib/spatial.

2有关相似性度量的概述,请参阅 Klein, DJ 2010 年 5 月。“图中的中心性度量。”数学化学杂志47(4): 1209-1223。

2 For an overview of similarity measures, see Klein, D.J. May 2010. “Centrality measure in graphs.” Journal of Mathematical Chemistry 47(4): 1209-1223.

第 3 章使用图形进行数据建模

Chapter 3. Data Modeling with Graphs

在前面的章节中,我们描述了图形数据库与其他 NOSQL 存储和传统关系数据库相比的显著优势。但是,选择采用图形数据库后,问题就出现了:我们如何在图形中建模?

In previous chapters we’ve described the substantial benefits of the graph database when compared both with other NOSQL stores and with traditional relational databases. But having chosen to adopt a graph database, the question arises: how do we model in graphs?

本章重点介绍图形建模。首先回顾标记属性图形模型(最广泛采用的图形数据模型),然后概述本书中大多数代码示例使用的图形查询语言:Cypher。虽然存在几种图形查询语言,但 Cypher 是最广泛使用的,使其成为事实上的标准。它也很容易学习和理解,特别是对于我们这些有 SQL 背景的人来说。有了这些基础知识,我们将直接深入研究图形建模的一些示例。通过基于系统管理域的第一个示例,我们比较了关系和图形建模技术。在第二个示例中,即莎士比亚文学的制作和消费,我们使用图形来连接和查询几个不同的领域。我们将在本章的最后讨论使用图形建模时的一些常见陷阱,并重点介绍一些良好做法。

This chapter focuses on graph modeling. Starting with a recap of the labeled property graph model — the most widely adopted graph data model — we then provide an overview of the graph query language used for most of the code examples in this book: Cypher. Though there are several graph query languages in existence, Cypher is the most widely deployed, making it the de facto standard. It is also easy to learn and understand, especially for those of us coming from a SQL background. With these fundamentals in place, we dive straight into some examples of graph modeling. With our first example, based on a systems management domain, we compare relational and graph modeling techniques. In the second example, the production and consumption of Shakespearean literature, we use a graph to connect and query several disparate domains. We end the chapter by looking at some common pitfalls when modeling with graphs, and highlight some good practices.

模型和目标

Models and Goals

在我们深入探讨使用图形进行建模之前,先来谈谈一般的模型。建模是一种由特定需求或目标驱动的抽象活动。我们进行建模是为了将一个混乱领域的特定方面带入一个可以对其进行结构化和操纵的空间。世界上没有“真实”的世界的自然表示,只有许多有目的的选择、抽象和简化,其中一些比其他的更有用,可以满足特定的目标。

Before we dig deeper into modeling with graphs, a word on models in general. Modeling is an abstracting activity motivated by a particular need or goal. We model in order to bring specific facets of an unruly domain into a space where they can be structured and manipulated. There are no natural representations of the world the way it “really is,” just many purposeful selections, abstractions, and simplifications, some of which are more useful than others for satisfying a particular goal.

图形表示在这方面没有什么不同。然而,它们与许多其他数据建模技术的区别可能在于逻辑模型和物理模型之间的密切关系。关系数据管理技术要求我们偏离我们对该领域的自然语言表示:首先将我们的表示哄骗成逻辑模型,然后将其强制成物理模型。这些转换在我们对世界的概念化和数据库对该模型的实例化之间引入了语义不一致。使用图形数据库,这个差距大大缩小了。

Graph representations are no different in this respect. What perhaps differentiates them from many other data modeling techniques, however, is the close affinity between the logical and physical models. Relational data management techniques require us to deviate from our natural language representation of the domain: first by cajoling our representation into a logical model, and then by forcing it into a physical model. These transformations introduce semantic dissonance between our conceptualization of the world and the database’s instantiation of that model. With graph databases, this gap shrinks considerably.

标记属性图模型

The Labeled Property Graph Model

我们在第 1 章中介绍了标记属性图模型。总结一下,它的显著特点如下:

We introduced the labeled property graph model in Chapter 1. To recap, these are its salient features:

  • 标记属性图节点关系属性标签组成。
  • A labeled property graph is made up of nodes, relationships, properties, and labels.
  • 节点包含属性。将节点视为以任意键值对形式存储属性的文档。在 Neo4j 中,键是字符串,值是 Java 字符串和原始数据类型,以及这些类型的数组。
  • Nodes contain properties. Think of nodes as documents that store properties in the form of arbitrary key-value pairs. In Neo4j, the keys are strings and the values are the Java string and primitive data types, plus arrays of these types.
  • 节点可以标有一个或多个标签。标签将节点分组在一起,并指示它们在数据集中扮演的角色。
  • Nodes can be tagged with one or more labels. Labels group nodes together, and indicate the roles they play within the dataset.
  • 关系连接节点并构造图表。关系始终具有方向、单个名称以及起始节点终止节点— 没有悬空关系。关系的方向和名称共同为节点的结构增加了语义清晰度。
  • Relationships connect nodes and structure the graph. A relationship always has a direction, a single name, and a start node and an end node — there are no dangling relationships. Together, a relationship’s direction and name add semantic clarity to the structuring of nodes.
  • 与节点一样,关系也可以具有属性。向关系添加属性的能力对于为图算法提供额外的元数据、向关系添加额外的语义(包括质量和权重)以及在运行时约束查询特别有用。
  • Like nodes, relationships can also have properties. The ability to add properties to relationships is particularly useful for providing additional metadata for graph algorithms, adding additional semantics to relationships (including quality and weight), and for constraining queries at runtime.

这些简单的原语就是我们创建复杂且语义丰富的模型所需要的全部内容。到目前为止,我们所有的模型都是以图表的形式出现的。图表非常适合描述任何技术背景之外的图形,但在使用数据库时,我们需要其他机制来创建、操作和查询数据。我们需要一种查询语言。

These simple primitives are all we need to create sophisticated and semantically rich models. So far, all our models have been in the form of diagrams. Diagrams are great for describing graphs outside of any technology context, but when it comes to using a database, we need some other mechanism for creating, manipulating, and querying data. We need a query language.

查询图表:Cypher 简介

Querying Graphs: An Introduction to Cypher

Cypher 是一种富有表现力(但又紧凑)的图形数据库查询语言。虽然目前特定于 Neo4j,但它与我们以图表形式表示图形的习惯非常相似,因此非常适合以编程方式描述图形。因此,我们在本书的其余部分都使用 Cypher 来说明图形查询和图形构造。Cypher 可以说是最容易学习的图形查询语言,是学习图形的良好基础。一旦您理解了 Cypher,就很容易扩展并学习其他图形查询语言。

Cypher is an expressive (yet compact) graph database query language. Although currently specific to Neo4j, its close affinity with our habit of representing graphs as diagrams makes it ideal for programmatically describing graphs. For this reason, we use Cypher throughout the rest of this book to illustrate graph queries and graph constructions. Cypher is arguably the easiest graph query language to learn, and is a great basis for learning about graphs. Once you understand Cypher, it becomes very easy to branch out and learn other graph query languages.

在以下部分中,我们将简要介绍 Cypher。然而,这不是 Cypher 的参考文档——只是一个友好的介绍,以便我们以后可以探索更有趣的图形查询场景。1

In the following sections we’ll take a brief tour through Cypher. This isn’t a reference document for Cypher, however — merely a friendly introduction so that we can explore more interesting graph query scenarios later on.1

密码哲学

Cypher Philosophy

Cypher 的设计目标是让开发人员、数据库专业人员和业务利益相关者能够轻松阅读和理解。它的易用性源于它符合我们用图表直观描述图形的方式。

Cypher is designed to be easily read and understood by developers, database professionals, and business stakeholders. Its ease of use derives from the fact that it is in accord with the way we intuitively describe graphs using diagrams.

Cypher 允许用户(或代表用户行事的应用程序)请求数据库查找符合特定模式的数据。通俗地说,我们请求数据库“查找类似的东西”。我们描述“类似的东西”的样子的方式是使用 ASCII 艺术画出来。图 3-1显示了一个简单的模式示例。

Cypher enables a user (or an application acting on behalf of a user) to ask the database to find data that matches a specific pattern. Colloquially, we ask the database to “find things like this.” And the way we describe what “things like this” look like is to draw them, using ASCII art. Figure 3-1 shows an example of a simple pattern.

格数据库 0301
图 3-1.用图表表达的简单图形模式

此模式描述了三个共同的朋友。以下是 Cypher 中等效的 ASCII 艺术表示:

This pattern describes three mutual friends. Here’s the equivalent ASCII art representation in Cypher:

(emil)<-[:KNOWS]-(jim)-[:KNOWS]->(ian)-[:KNOWS]->(emil)
(emil)<-[:KNOWS]-(jim)-[:KNOWS]->(ian)-[:KNOWS]->(emil)

这种模式描述了一条连接一个节点(我们称其为 )和两个节点(我们称其为和)的路径,也将节点连接到节点。、和是标识符。标识符允许我们在描述模式时多次引用同一个节点——这个技巧可以帮助我们绕过查询语言只有一维(文本从左到右进行)的事实,而图形可以以二维布局。尽管偶尔需要以这种方式重复标识符,但意图仍然很明确。密码模式非常自然地遵循我们在白板上绘制图形的方式。jimianemilianemilianjimemil

This pattern describes a path that connects a node we’ll call jim to two nodes we’ll call ian and emil, and which also connects the ian node to the emil node. ian, jim, and emil are identifers. Identifiers allow us to refer to the same node more than once when describing a pattern — a trick that helps us get round the fact that a query language has only one dimension (text proceeding from left to right), whereas a graph diagram can be laid out in two dimensions. Despite having occasionally to repeat identifiers in this way, the intent remains clear. Cypher patterns follow very naturally from the way we draw graphs on the whiteboard.


笔记

先前的 Cypher 模式描述了一个简单的图形结构,但它尚未引用数据库中的任何特定数据。要将模式绑定到现有数据集中的特定节点和关系,我们必须指定一些属性值和节点标签,以帮助定位数据集中的相关元素。例如:

The previous Cypher pattern describes a simple graph structure, but it doesn’t yet refer to any particular data in the database. To bind the pattern to specific nodes and relationships in an existing dataset we must specify some property values and node labels that help locate the relevant elements in the dataset. For example:

(emil:Person {name:'Emil'})
  <-[:KNOWS]-(jim:Person {name:'Jim'})
  -[:KNOWS]->(ian:Person {name:'Ian'})
  -[:KNOWS]->(emil)
(emil:Person {name:'Emil'})
  <-[:KNOWS]-(jim:Person {name:'Jim'})
  -[:KNOWS]->(ian:Person {name:'Ian'})
  -[:KNOWS]->(emil)

这里我们使用每个节点的name属性和Person标签将其绑定到其标识符。emil例如,标识符与数据集中带有标签Personname值为 的属性的节点绑定Emil。以这种方式将模式的各个部分锚定到真实数据是 Cypher 的常规做法,我们将在以下部分中看到。

Here we’ve bound each node to its identifier using its name property and Person label. The emil identifer, for example, is bound to a node in the dataset with a label Person and a name property whose value is Emil. Anchoring parts of the pattern to real data in this way is normal Cypher practice, as we shall see in the following sections.


ASCII 艺术图形模式是 Cypher 的基础。Cypher 查询使用谓词将模式的一个或多个部分锚定到图形中的特定位置,然后调整未锚定的部分以查找局部匹配。

ASCII art graph patterns are fundamental to Cypher. A Cypher query anchors one or more parts of a pattern to specific locations in a graph using predicates, and then flexes the unanchored parts around to find local matches.


笔记

Cypher 根据查询中的标签和属性谓词确定实际图中的锚点(模式的某些部分绑定到这些锚点)。在大多数情况下,Cypher 使用有关现有索引、约束和谓词的元信息来自动解决问题。然而,偶尔指定一些额外的提示也会有所帮助。

The anchor points in the real graph, to which some parts of the pattern are bound, are determined by Cypher based on the labels and property predicates in the query. In most cases, Cypher uses metainformation about existing indexes, constraints, and predicates to figure things out automatically. Occasionally, however, it helps to specify some additional hints.


与大多数查询语言一样,Cypher 由子句组成。最简单的查询由一个MATCH子句和一个RETURN子句组成(我们将在本章后面介绍您可以在 Cypher 查询中使用的其他子句)。以下是使用这三个子句查找名为 的用户的共同好友的 Cypher 查询示例Jim

Like most query languages, Cypher is composed of clauses. The simplest queries consist of a MATCH clause followed by a RETURN clause (we’ll describe the other clauses you can use in a Cypher query later in this chapter). Here’s an example of a Cypher query that uses these three clauses to find the mutual friends of a user named Jim:

MATCH (a:Person {name:'Jim'})-[:KNOWS]->(b)-[:KNOWS]->(c),
      (a)-[:KNOWS]->(c)
RETURN b, c
MATCH (a:Person {name:'Jim'})-[:KNOWS]->(b)-[:KNOWS]->(c),
      (a)-[:KNOWS]->(c)
RETURN b, c

让我们更详细地看一下每个条款。

Let’s look at each clause in more detail.

匹配

MATCH

MATCH子句是大多数 Cypher 查询的核心。这是示例部分。使用 ASCII 字符表示节点和关系,我们绘制我们感兴趣的数据。我们用括号绘制节点,用大于号或小于号(--><--)的破折号对绘制关系。<>符号表示关系方向。在用方括号分隔并以冒号为前缀的破折号之间,我们放置关系名称。节点标签同样以冒号为前缀。然后在花括号内指定节点(和关系)属性键值对(非常类似于 Javascript 对象)。

The MATCH clause is at the heart of most Cypher queries. This is the specification by example part. Using ASCII characters to represent nodes and relationships, we draw the data we’re interested in. We draw nodes with parentheses, and relationships using pairs of dashes with greater-than or less-than signs (--> and <--). The < and > signs indicate relationship direction. Between the dashes, set off by square brackets and prefixed by a colon, we put the relationship name. Node labels are similarly prefixed by a colon. Node (and relationship) property key-value pairs are then specified within curly braces (much like a Javascript object).

在我们的示例查询中,我们正在寻找一个Person带有属性标签的节点,name该属性的值为Jim。此查找的返回值与标识符 绑定a。此标识符允许我们在查询的其余部分引用代表 Jim 的节点。

In our example query, we’re looking for a node labeled Person with a name property whose value is Jim. The return value from this lookup is bound to the identifier a. This identifier allows us to refer to the node that represents Jim throughout the rest of the query.

此起始节点是简单模式的一部分(a)-[:KNOWS]->(b)-[:KNOWS]->(c), (a)-[:KNOWS]->(c),该模式描述了一条由三个节点组成的路径,其中一个节点我们已绑定到标识符a,其他节点已绑定到bc。这些节点通过多种关系连接,如图3-1KNOWS所示。

This start node is part of a simple pattern (a)-[:KNOWS]->(b)-[:KNOWS]->(c), (a)-[:KNOWS]->(c) that describes a path comprising three nodes, one of which we’ve bound to the identifier a, the others to b and c. These nodes are connected by way of several KNOWS relationships, as per Figure 3-1.

从理论上讲,这种模式可能在我们的图形数据中出现多次;对于大型用户集,可能存在许多与此模式相对应的相互关系。为了本地化查询,我们需要将其某些部分锚定到图形中的一个或多个位置。在指定我们正在寻找标记为Personname属性值为 的节点时Jim,我们将模式绑定到图形中的特定节点 - 代表 Jim 的节点。然后,Cypher 将模式的其余部分与紧邻此锚点的图形进行匹配。在此过程中,它会发现要绑定到其他标识符的节点。虽然a将始终锚定到 Jim,b并且c将在查询执行时绑定到一系列节点。

This pattern could, in theory, occur many times throughout our graph data; with a large user set, there may be many mutual relationships corresponding to this pattern. To localize the query, we need to anchor some part of it to one or more places in the graph. In specifying that we’re looking for a node labeled Person whose name property value is Jim, we’ve bound the pattern to a specific node in the graph — the node representing Jim. Cypher then matches the remainder of the pattern to the graph immediately surrounding this anchor point. As it does so, it discovers nodes to bind to the other identifiers. While a will always be anchored to Jim, b and c will be bound to a sequence of nodes as the query executes.

或者,我们可以将锚定表达为子句中的谓语WHERE

Alternatively, we can express the anchoring as a predicate in the WHERE clause.

MATCH (a:Person)-[:KNOWS]->(b)-[:KNOWS]->(c), (a)-[:KNOWS]->(c)
WHERE a.name = 'Jim'
RETURN b, c
MATCH (a:Person)-[:KNOWS]->(b)-[:KNOWS]->(c), (a)-[:KNOWS]->(c)
WHERE a.name = 'Jim'
RETURN b, c

这里我们将属性查找从子句移到MATCHWHERE子句。结果与我们之前的查询相同。

Here we’ve moved the property lookup from the MATCH clause to the WHERE clause. The outcome is the same as our earlier query.

返回

RETURN

此子句指定应将匹配数据中的哪些节点、关系和属性返回给客户端。在我们的示例查询中,我们感兴趣的是返回与bc标识符绑定的节点。当客户端迭代结果时,每个匹配节点都会延迟绑定到其标识符。

This clause specifies which nodes, relationships, and properties in the matched data should be returned to the client. In our example query, we’re interested in returning the nodes bound to the b and c identifiers. Each matching node is lazily bound to its identifier as the client iterates the results.

其他密码条款

Other Cypher Clauses

我们可以在 Cypher 查询中使用的其他子句包括:

The other clauses we can use in a Cypher query include:

WHERE
提供过滤模式匹配结果的标准。
Provides criteria for filtering pattern matching results.
CREATECREATE UNIQUE
创造节点和关系。
Create nodes and relationships.
MERGE
确保通过重用与提供的谓词匹配的现有节点和关系,或者通过创建新的节点和关系,来确保提供的模式存在于图中。
Ensures that the supplied pattern exists in the graph, either by reusing existing nodes and relationships that match the supplied predicates, or by creating new nodes and relationships.
DELETE
移除节点、关系和属性。
Removes nodes, relationships, and properties.
SET
财产价值。
Sets property values.
FOREACH
表演列表中每个元素的更新操作。
Performs an updating action for each element in a list.
UNION
合并来自两个或多个查询的结果。
Merges results from two or more queries.
WITH
链条后续查询部分并将结果从一个部分转发到下一个部分。类似于 Unix 中的管道命令。
Chains subsequent query parts and forwards results from one to the next. Similar to piping commands in Unix.
START
指定图中一个或多个明确的起点(节点或关系)。 (START已弃用,改为在MATCH子句中指定锚点。)
Specifies one or more explicit starting points — nodes or relationships — in the graph. (START is deprecated in favor of specifying anchor points in a MATCH clause.)

如果这些子句看起来很熟悉——尤其是如果你是 SQL 开发人员——那就太好了!Cypher 旨在让你足够熟悉,以帮助你快速完成学习曲线。同时,它又足够不同,强调我们处理的是图,而不是关系集。

If these clauses look familiar — especially if you’re a SQL developer — that’s great! Cypher is intended to be familiar enough to help you move rapidly along the learning curve. At the same time, it’s different enough to emphasize that we’re dealing with graphs, not relational sets.

在本章后面,我们将看到这些条款的一些例子。当它们出现时,我们将更详细地描述它们如何工作。

We’ll see some examples of these clauses later in the chapter. Where they occur, we’ll describe in more detail how they work.

现在我们已经了解了如何使用 Cypher 描述和查询图形,我们可以看一些图形建模的示例。

Now that we’ve seen how we can describe and query a graph using Cypher, we can look at some examples of graph modeling.

关系建模与图形建模的比较

A Comparison of Relational and Graph Modeling

为了介绍图形建模,我们将研究如何使用关系和基于图形的技术对域进行建模。大多数开发人员和数据专业人员都熟悉 RDBMS(关系数据库管理系统)和相关的数据建模技术;因此,比较将突出一些相似之处和许多不同之处。特别是,我们将看到从概念图形模型转移到物理图形模型是多么容易,以及与关系模型相比,图形模型对我们试图表示的内容的扭曲程度有多小。

To introduce graph modeling, we’re going to look at how we model a domain using both relational- and graph-based techniques. Most developers and data professionals are familiar with RDBMS (relational database management systems) and the associated data modeling techniques; as a result, the comparison will highlight a few similarities, and many differences. In particular, we’ll see how easy it is to move from a conceptual graph model to a physical graph model, and how little the graph model distorts what we’re trying to represent versus the relational model.

为了便于比较,我们将研究一个简单的数据中心管理域。在这个域中,多个数据中心使用不同的基础设施(从虚拟机到物理负载平衡器)代表许多客户支持许多应用程序。图 3-2显示了此域的一个示例。

To facilitate this comparison, we’ll examine a simple data center management domain. In this domain, several data centers support many applications on behalf of many customers using different pieces of infrastructure, from virtual machines to physical load balancers. An example of this domain is shown in Figure 3-2.

图 3-2中,我们看到了几个应用程序和支持它们所需数据中心基础架构的简化视图。应用程序以节点App 1App 2和表示App 3,依赖于标记为 的数据库集群Database Server 1, 2, 3。虽然用户在逻辑上依赖于应用程序及其数据的可用性,但用户和应用程序之间还有额外的物理基础设施;此基础设施包括虚拟机(Virtual Machine 10, 11, 20, 30, 31)、真实服务器(Server 1, 2, 3)、服务器机架(Rack 1, 2)和负载均衡器(Load Balancer 1, 2),它们位于应用程序前面。当然,在每个组件之间有许多网络元素:电缆、交换机、配线架、NIC(网络接口控制器)、电源、空调等等——所有这些都可能在不方便的时候出现故障。为了使这幅图完整,我们有一个应用程序 3 的稻草人单个用户,用 表示User 3

In Figure 3-2 we see a somewhat simplified view of several applications and the data center infrastructure necessary to support them. The applications, represented by nodes App 1, App 2, and App 3, depend on a cluster of databases labeled Database Server 1, 2, 3. While users logically depend on the availability of an application and its data, there is additional physical infrastructure between the users and the application; this infrastructure includes virtual machines (Virtual Machine 10, 11, 20, 30, 31), real servers (Server 1, 2, 3), racks for the servers (Rack 1, 2 ), and load balancers (Load Balancer 1, 2), which front the apps. In between each of the components there are, of course, many networking elements: cables, switches, patch panels, NICs (network interface controllers), power supplies, air conditioning, and so on — all of which can fail at inconvenient times. To complete the picture we have a straw-man single user of application 3, represented by User 3.

作为该系统的运营商,我们主要关注两个方面:

As the operators of such a system, we have two primary concerns:

  • 持续提供满足(或超越)服务水平协议的功能,包括执行前瞻性分析以确定单点故障的能力,以及执行回顾性分析以快速确定客户对服务可用性投诉的原因。
  • Ongoing provision of functionality to meet (or exceed) a service-level agreement, including the ability to perform forward-looking analyses to determine single points of failure, and retrospective analyses to rapidly determine the cause of any customer complaints regarding the availability of service.
  • 对所消耗的资源进行计费,包括硬件成本、虚拟化成本、网络配置成本,甚至软件开发和运营成本(因为这些只是我们在此处看到的系统的逻辑扩展)。
  • Billing for resources consumed, including the cost of hardware, virtualization, network provisioning, and even the costs of software development and operations (since these are simply logical extensions of the system we see here).
格数据库 0302
图 3-2.数据中心内应用程序部署的简化快照

如果我们正在构建数据中心管理解决方案,我们将希望确保底层数据模型允许我们以有效解决这些主要问题的方式存储和查询数据。我们还希望能够随着应用程序组合的变化、数据中心的物理布局的发展以及虚拟机实例的迁移而更新底层模型。考虑到这些需求和限制,让我们看看关系模型和图形模型的比较。

If we are building a data center management solution, we’ll want to ensure that the underlying data model allows us to store and query data in a way that efficiently addresses these primary concerns. We’ll also want to be able to update the underlying model as the application portfolio changes, the physical layout of the data center evolves, and virtual machine instances migrate. Given these needs and constraints, let’s see how the relational and graph models compare.

系统管理领域中的关系建模

Relational Modeling in a Systems Management Domain

关系世界中建模的初始阶段与许多其他数据建模技术的第一阶段类似:即我们寻求理解并就领域中的实体、它们如何相互关联以及控制它们的状态转换的规则达成一致。大部分工作往往以非正式的方式进行,通常是通过白板草图和主题专家与系统和数据架构师之间的讨论。为了表达我们的共同理解和一致意见,我们通常会创建一个图表,如图3-2所示,这是一个图形。

The initial stage of modeling in the relational world is similar to the first stage of many other data modeling techniques: that is, we seek to understand and agree on the entities in the domain, how they interrelate, and the rules that govern their state transitions. Most of this tends to be done informally, often through whiteboard sketches and discussions between subject matter experts and systems and data architects. To express our common understanding and agreement, we typically create a diagram such as the one in Figure 3-2, which is a graph.

下一阶段将以更严格的形式(如实体关系 (ER) 图)捕获此一致性 — 另一个图。使用更严格的符号将概念模型转换为逻辑模型,这为我们提供了第二次机会来完善我们的领域词汇表,以便与关系数据库专家共享。(这种方法并非总是必要的:熟练的关系用户通常会直接转到表设计和规范化,而无需先描述中间 ER 图。)在我们的示例中,我们在图3-3所示的 ER 图中捕获了领域。

The next stage captures this agreement in a more rigorous form such as an entity-relationship (E-R) diagram — another graph. This transformation of the conceptual model into a logical model using a stricter notation provides us with a second chance to refine our domain vocabulary so that it can be shared with relational database specialists. (Such approaches aren’t always necessary: adept relational users often move directly to table design and normalization without first describing an intermediate E-R diagram.) In our example, we’ve captured the domain in the E-R diagram shown in Figure 3-3.


笔记

尽管 ER 图是图形,但它立即展示了关系模型在捕获丰富域方面的缺点。尽管它们允许命名关系(图形数据库完全支持这一点,但关系存储不支持),但 ER 图只允许实体之间存在单一、无向、命名的关系。在这方面,关系模型不太适合现实世界的领域,因为实体之间的关系既众多又在语义上丰富多样。

Despite being graphs, E-R diagrams immediately demonstrate the shortcomings of the relational model for capturing a rich domain. Although they allow relationships to be named (something that graph databases fully embrace, but which relational stores do not), E-R diagrams allow only single, undirected, named relationships between entities. In this respect, the relational model is a poor fit for real-world domains where relationships between entities are both numerous and semantically rich and diverse.


找到合适的逻辑模型后,我们将其映射到表和关系中,并对其进行规范化以消除数据冗余。在许多情况下,这一步骤可以简单到将 ER 图转录为表格形式,然后通过 SQL 命令将这些表加载到数据库中。但即使是最简单的情况也能凸显关系模型的特性。例如,在图 3-4中,我们看到大量意外的复杂性以外键约束(所有带注释的)的形式潜入模型中[FK],这些约束支持一对多关系,以及连接表(例如AppDatabase),这些表支持多对多关系——所有这些都发生在我们添加一行真实用户数据之前。这些约束是模型级元数据,其存在只是为了让我们可以在查询时具体化表之间的关系。然而,这种结构化数据的存在是显而易见的,因为它使域数据变得混乱和模糊,数据服务于数据库,而不是用户。

Having arrived at a suitable logical model, we map it into tables and relations, which are normalized to eliminate data redundancy. In many cases this step can be as simple as transcribing the E-R diagram into a tabular form and then loading those tables via SQL commands into the database. But even the simplest case serves to highlight the idiosyncrasies of the relational model. For example, in Figure 3-4 we see that a great deal of accidental complexity has crept into the model in the form of foreign key constraints (everything annotated [FK]), which support one-to-many relationships, and join tables (e.g., AppDatabase), which support many-to-many relationships — and all this before we’ve added a single row of real user data. These constraints are model-level metadata that exist simply so that we can make concrete the relations between tables at query time. Yet the presence of this structural data is keenly felt, because it clutters and obscures the domain data with data that serves the database, not the user.

格数据库 0303
图 3-3数据中心域的实体关系图

我们现在有了一个相对忠实于领域的规范化模型。尽管这个模型以外键和连接表的形式充满了大量偶然复杂性,但它不包含重复数据。但我们的设计工作尚未完成。关系范式的挑战之一是,规范化模型通常不够快,无法满足实际需求。对于许多生产系统,规范化模式在理论上适合回答我们可能希望向领域提出的任何类型的临时问题,但在实践中必须进一步调整和专门化以适应特定的访问模式。换句话说,为了使关系存储足以满足常规应用程序的需求,我们必须抛弃任何真正的领域亲和力的痕迹,并接受我们必须更改用户的数据模型以适应数据库引擎而不是用户的事实。这种技术称为非规范化

We now have a normalized model that is relatively faithful to the domain. This model, though imbued with substantial accidental complexity in the form of foreign keys and join tables, contains no duplicate data. But our design work is not yet complete. One of the challenges of the relational paradigm is that normalized models generally aren’t fast enough for real-world needs. For many production systems, a normalized schema, which in theory is fit for answering any kind of ad hoc question we may wish to pose to the domain, must in practice be further adapted and specialized for specific access patterns. In other words, to make relational stores perform well enough for regular application needs, we have to abandon any vestiges of true domain affinity and accept that we have to change the user’s data model to suit the database engine, not the user. This technique is called denormalization.

反规范化涉及复制数据(在某些情况下是大量复制)以获得查询性能。以用户及其联系方式为例。典型用户通常有多个电子邮件地址,在完全规范化的模型中,我们会将其存储在单独的EMAIL表中。但是,为了减少连接以及两个表之间连接造成的性能损失,通常会在表中内联这些数据USER,添加一个或多个列来存储用户最重要的电子邮件地址。

Denormalization involves duplicating data (substantially in some cases) in order to gain query performance. Take as an example users and their contact details. A typical user often has several email addresses, which, in a fully normalized model, we would store in a separate EMAIL table. To reduce joins and the performance penalty imposed by joining between two tables, however, it is quite common to inline this data in the USER table, adding one or more columns to store a user’s most important email addresses.

格数据库 0304
图 3-4.数据中心域的表和关系

尽管反规范化可能是一件安全的事情(假设开发人员了解反规范化模型以及它如何映射到以领域为中心的代码,并且拥有来自数据库的强大事务支持),但这通常不是一项简单的任务。为了获得最佳结果,我们通常会求助于真正的 RDBMS 专家,将我们的规范化模型转换为与底层 RDBMS 和物理存储层的特征一致的反规范化模型。在这样做时,我们接受可能存在大量的数据冗余。

Although denormalization may be a safe thing to do (assuming developers understand the denormalized model and how it maps to their domain-centric code, and have robust transactional support from the database), it is usually not a trivial task. For the best results, we usually turn to a true RDBMS expert to munge our normalized model into a denormalized one aligned with the characteristics of the underlying RDBMS and physical storage tier. In doing this, we accept that there may be substantial data redundancy.

我们可能会认为,所有这些设计-规范化-非规范化工作都是可以接受的,因为这是一次性任务。这种思想流派认为,这项工作的成本在系统的整个生命周期内分摊(包括开发和生产),因此,与项目的总成本相比,生成高性能关系模型的工作量相对较小。这是一个有吸引力的想法,但在许多情况下,它与现实不符,因为系统不仅在开发过程中发生变化,而且在生产生命周期中也会发生变化。

We might be tempted to think that all this design-normalize-denormalize effort is acceptable because it is a one-off task. This school of thought suggests that the cost of the work is amortized across the entire lifetime of the system (which includes both development and production) such that the effort of producing a performant relational model is comparatively small compared to the overall cost of the project. This is an appealing notion, but in many cases it doesn’t match reality, because systems change not only during development, but also during their production lifetimes.

数据模型变更的摊销观点认为,开发过程中昂贵的变更被生产中稳定的模型的长期利益所掩盖,这种观点假设系统大部分时间都在生产环境中,并且这些生产环境是稳定的。虽然大多数系统大部分时间都在生产环境中,但这些环境很少是稳定的。随着业务需求的变化或监管要求的发展,我们的系统及其所基于的数据结构也必须随之变化。

The amortized view of data model change, in which costly changes during development are eclipsed by the long-term benefits of a stable model in production, assumes that systems spend the majority of their time in production environments, and that these production environments are stable. Though it may be the case that most systems spend most of their time in production environments, these environments are rarely stable. As business requirements change or regulatory requirements evolve, so must our systems and the data structures on which they are built.

在项目的设计和开发阶段,数据模型不可避免地会经历大量修改,而且几乎在所有情况下,这些修改都是为了使模型适应在投入生产后会使用该模型的应用程序的需求。这些初始设计的影响是如此强大,以至于一旦投入生产,几乎不可能修改应用程序和模型以适应其最初设计时未考虑的事情。

Data models invariably undergo substantial revision during the design and development phases of a project, and in almost every case, these revisions are intended to accommodate the model to the needs of the applications that will consume it once it is in production. These initial design influences are so powerful that it becomes nearly impossible to modify the application and the model once they’re in production to accommodate things they were not originally designed to do.

技术我们将结构变化引入数据库的机制称为迁移,这种机制因Rails等应用程序开发框架而广为人知。迁移提供了一种结构化、分步式的方法,可将一组数据库重构应用于数据库,以便可以负责任地发展数据库以满足使用它的应用程序不断变化的需求。然而,与通常只需几秒钟或几分钟即可完成的代码重构不同,数据库重构可能需要数周或数月才能完成,并且架构更改会导致停机。数据库重构速度慢、风险大且成本高昂。

The technical mechanism by which we introduce structural change into a database is called migration, as popularized by application development frameworks such as Rails. Migrations provide a structured, step-wise approach to applying a set of database refactorings to a database so that it can be responsibly evolved to meet the changing needs of the applications that use it. Unlike code refactorings, however, which we typically accomplish in a matter of seconds or minutes, database refactorings can take weeks or months to complete, with downtime for schema changes. Database refactoring is slow, risky, and expensive.

那么,非规范化模型的问题在于它抵制业务对其系统所要求的那种快速发展。正如我们在数据中心示例中所看到的,在实施关系解决方案的过程中对白板模型施加的更改在概念世界和数据的物理布局方式之间造成了鸿沟;这种概念关系不协调几乎阻止了业务利益相关者积极协作以进一步发展系统。利益相关者的参与止步于关系大厦的门槛。在开发方面,将变化的业务需求转化为底层和根深蒂固的关系结构的困难导致系统的发展落后于业务的发展。如果没有专家的帮助和严格的规划,迁移非规范化数据库会带来多种风险。如果迁移无法保持存储亲和性,性能可能会受到影响。同样严重的是,如果故意复制的数据在迁移后被遗弃,我们可能会损害整个数据的完整性。

The problem, then, with the denormalized model is its resistance to the kind of rapid evolution the business demands of its systems. As we’ve seen with the data center example, the changes imposed on the whiteboard model over the course of implementing a relational solution create a gulf between the conceptual world and the way the data is physically laid out; this conceptual-relational dissonance all but prevents business stakeholders from actively collaborating in the further evolution of a system. Stakeholder participation stops at the threshold of the relational edifice. On the development side, the difficulties in translating changed business requirements into the underlying and entrenched relational structure cause the evolution of the system to lag behind the evolution of the business. Without expert assistance and rigorous planning, migrating a denormalized database poses several risks. If the migrations fail to maintain storage-affinity, performance can suffer. Just as serious, if deliberately duplicated data is left orphaned after a migration, we risk compromising the integrity of the data as a whole.

系统管理领域的图形建模

Graph Modeling in a Systems Management Domain

我们已经看到,关系建模及其伴随的实施活动如何将我们引向一条道路,将应用程序的底层存储模型与其利益相关者的概念世界观分离开来。关系数据库(具有僵化的架构和复杂的建模特征)并不是支持快速变化的特别好的工具。我们需要的是一个与领域紧密结合的模型,但不会牺牲性能,并且支持演进,同时在数据经历快速变化和增长时保持数据的完整性。这个模型就是图形模型。那么,当使用图形数据模型实现时,这个过程有何不同?

We’ve seen how relational modeling and its attendant implementation activities take us down a path that divorces an application’s underlying storage model from the conceptual worldview of its stakeholders. Relational databases — with their rigid schemas and complex modeling characteristics — are not an especially good tool for supporting rapid change. What we need is a model that is closely aligned with the domain, but that doesn’t sacrifice performance, and that supports evolution while maintaining the integrity of the data as it undergoes rapid change and growth. That model is the graph model. How, then, does this process differ when realized with a graph data model?

在分析的早期阶段,我们需要做的工作与关系方法类似:使用低保真方法(例如白板草图)描述并就领域达成一致。然而,在此之后,方法论发生了变化。我们不是将领域模型的图形表示转换为表格,而是对其进行丰富,目的是生成与我们的应用目标相关的领域部分的准确表示。也就是说,对于我们领域中的每个实体,我们确保将其相关角色捕获为标签,将其属性捕获为属性,并将其与相邻实体的连接捕获为关系。

In the early stages of analysis, the work required of us is similar to the relational approach: using lo-fi methods, such as whiteboard sketches, we describe and agree upon the domain. After that, however, the methodologies diverge. Instead of transforming a domain model’s graph-like representation into tables, we enrich it, with the aim of producing an accurate representation of the parts of the domain relevant to our application goals. That is, for each entity in our domain, we ensure that we’ve captured its relevant roles as labels, its attributes as properties, and its connections to neighboring entities as relationships.


笔记

请记住,领域模型并不是一个透明的、与上下文无关的现实窗口:相反,它是对领域中与我们的应用目标相关的方面有目的的抽象。创建模型总是有动机的。通过用额外的属性和关系丰富我们最初的领域图,我们可以有效地生成一个适合我们应用程序数据需求的图模型;也就是说,我们可以回答应用程序将对其数据提出的问题。

Remember, the domain model is not a transparent, context-free window onto reality: rather, it is a purposeful abstraction of those aspects of our domain relevant to our application goals. There’s always some motivation for creating a model. By enriching our first-cut domain graph with additional properties and relationships, we effectively produce a graph model attuned to our application’s data needs; that is, we provide for answering the kinds of questions our application will ask of its data.


值得一提的是,领域建模与图形建模完全同构。通过确保领域模型的正确性,我们正在隐性地改进图形模型,因为在图形数据库中,您在白板上绘制的内容通常就是您在数据库中存储的内容

Helpfully, domain modeling is completely isomorphic to graph modeling. By ensuring the correctness of the domain model, we’re implicitly improving the graph model, because in a graph database what you sketch on the whiteboard is typically what you store in the database.

从图的角度来看,我们所做的是确保每个节点都具有适当的特定于角色的标签和属性,以便它能够履行其专用的数据中心域职责。但我们还要确保每个节点都放置在正确的语义上下文中;我们通过在节点之间创建命名和定向(通常是属性)关系来捕获域的结构方面。对于我们的数据中心场景,生成的图形模型如图3-5所示。

In graph terms, what we’re doing is ensuring that each node has the appropriate role-specific labels and properties so that it can fulfill its dedicated data-centric domain responsibilities. But we’re also ensuring that every node is placed in the correct semantic context; we do this by creating named and directed (and often attributed) relationships between nodes to capture the structural aspects of the domain. For our data center scenario, the resulting graph model looks like Figure 3-5.

格德布 0305
图 3-5.数据中心部署场景示例图

从逻辑上讲,这就是我们需要做的全部。没有表格,没有规范化,也没有非规范化。一旦我们拥有了域模型的准确表示,将其移动到数据库中就很简单了,我们很快就会看到。

And logically, that’s all we need to do. No tables, no normalization, no denormalization. Once we have an accurate representation of our domain model, moving it into the database is trivial, as we shall see shortly.


笔记

请注意,此处的大多数节点都有两个标签:一个是特定类型的标签(例如DatabaseAppServer),另一个是更通用的Asset标签。这使我们能够使用某些查询来定位特定类型的资产,并使用其他查询来定位所有资产(无论类型如何)。

Note that most of the nodes here have two labels: both a specific type label (such as Database, App, or Server), and a more general-purpose Asset label. This allows us to target particular types of asset with some of our queries, and all assets, irrespective of type, with other queries.


测试模型

Testing the Model

一旦我们完善了领域模型,下一步就是测试它是否适合回答实际查询。尽管图形非常适合支持不断发展的结构(因此可以纠正任何错误的早期设计决策),但有一些设计决策一旦融入我们的应用程序,就会阻碍我们进一步发展。通过在这个早期阶段审查领域模型和由此产生的图形模型,我们可以避免这些陷阱。图形结构的后续更改将完全由业务变化驱动,而不是由缓解不良设计决策的需求驱动。

Once we’ve refined our domain model, the next step is to test how suitable it is for answering realistic queries. Although graphs are great for supporting a continuously evolving structure (and therefore for correcting any erroneous earlier design decisions), there are a number of design decisions that, once they are baked into our application, can hamper us further down the line. By reviewing the domain model and the resulting graph model at this early stage, we can avoid these pitfalls. Subsequent changes to the graph structure will then be driven solely by changes in the business, rather than by the need to mitigate poor design decisions.

实际上,我们可以在这里应用两种技术。第一种也是最简单的一种就是检查图表是否读得好。我们选择一个起始节点,然后按照关系到其他节点,边走边读取每个节点的标签和每个关系的名称。这样做应该可以创建合理的句子。对于我们的数据中心示例,我们可以读出这样的句子:“由应用程序实例 1、2 和 3 组成的应用程序使用位于数据库服务器 1、2 和 3 上的数据库”,以及“服务器 3 运行 VM 31,它托管应用程序实例 3。”如果以这种方式读取图表有意义,我们可以相当有信心它忠实于领域。

In practice there are two techniques we can apply here. The first, and simplest, is just to check that the graph reads well. We pick a start node, and then follow relationships to other nodes, reading each node’s labels and each relationship’s name as we go. Doing so should create sensible sentences. For our data center example, we can read off sentences like “The App, which consists of App Instances 1, 2, and 3, uses the Database, which resides on Database Servers 1, 2 and 3,” and “Server 3 runs VM 31, which hosts App Instance 3.” If reading the graph in this way makes sense, we can be reasonably confident it is faithful to the domain.

为了进一步增强信心,我们还需要考虑我们将在图表上运行的查询。在这里,我们采用了可查询性思维设计。为了验证图表是否支持我们期望在其上运行的查询类型,我们必须描述这些查询。这要求我们了解最终用户的目标;也就是说,图表将应用于哪些用例。例如,在我们的数据中心场景中,我们的一个用例涉及最终用户报告应用程序或服务无响应。为了帮助这些用户,我们必须找出无响应的原因,然后解决它。要确定可能出了什么问题,我们需要确定用户和应用程序之间的路径上有什么,以及应用程序依赖什么来向用户提供功能。给定数据中心域的特定图形表示,如果我们可以制作一个解决此用例的 Cypher 查询,我们就可以更加确定该图满足我们域的需求。

To further increase our confidence, we also need to consider the queries we’ll run on the graph. Here we adopt a design for queryability mindset. To validate that the graph supports the kinds of queries we expect to run on it, we must describe those queries. This requires us to understand our end users’ goals; that is, the use cases to which the graph is to be applied. In our data center scenario, for example, one of our use cases involves end users reporting that an application or service is unresponsive. To help these users, we must identify the cause of the unresponsiveness and then resolve it. To determine what might have gone wrong we need to identify what’s on the path between the user and the application, and also what the application depends on to deliver functionality to the user. Given a particular graph representation of the data center domain, if we can craft a Cypher query that addresses this use case, we can be even more certain that the graph meets the needs of our domain.

继续我们的示例用例,假设我们可以从常规网络监控工具更新图表,从而为我们提供近乎实时的网络状态视图。对于大型物理网络,我们可能使用复杂事件处理 (CEP)来处理低级网络事件流,仅当 CEP 解决方案引发重大域事件时才更新图表。当用户报告问题时,我们可以将物理故障查找限制在用户和应用程序以及应用程序及其依赖项之间的有问题的网络元素上。在我们的图表中,我们可以使用以下查询找到故障设备:

Continuing with our example use case, let’s assume we can update the graph from our regular network monitoring tools, thereby providing us with a near real-time view of the state of the network. With a large physical network, we might use Complex Event Processing (CEP) to process streams of low-level network events, updating the graph only when the CEP solution raises a significant domain event. When a user reports a problem, we can limit the physical fault-finding to problematic network elements between the user and the application and the application and its dependencies. In our graph we can find the faulty equipment with the following query:

MATCH (user:User)-[*1..5]-(asset:Asset)
WHERE user.name = 'User 3' AND asset.status = 'down'
RETURN DISTINCT asset
MATCH (user:User)-[*1..5]-(asset:Asset)
WHERE user.name = 'User 3' AND asset.status = 'down'
RETURN DISTINCT asset

MATCH此处的条款描述了长度可变的路径,长度介于一到五个关系之间。关系未命名且无方向(方括号之间没有冒号或关系名称,也没有箭头来指示方向)。这使我们能够匹配以下路径:

The MATCH clause here describes a variable length path between one and five relationships long. The relationships are unnamed and undirected (there’s no colon or relationship name between the square brackets, and no arrow-tip to indicate direction). This allows us to match paths such as:

(user)-[:USER_OF]->(app)
(user)-[:USER_OF]->(app)-[:USES]->(database)
(user)-[:USER_OF]->(app)-[:USES]->(database)-[:SLAVE_OF]->(another-database)
(user)-[:USER_OF]->(app)-[:RUNS_ON]->(vm)
(user)-[:USER_OF]->(app)-[:RUNS_ON]->(vm)-[:HOSTED_BY]->(server)
(user)-[:USER_OF]->(app)-[:RUNS_ON]->(vm)-[:HOSTED_BY]->(server)-[:IN]->(rack)
(user)-[:USER_OF]->(app)-[:RUNS_ON]->(vm)-[:HOSTED_BY]->(server)-[:IN]->(rack)
  <-[:IN]-(load-balancer)
(user)-[:USER_OF]->(app)
(user)-[:USER_OF]->(app)-[:USES]->(database)
(user)-[:USER_OF]->(app)-[:USES]->(database)-[:SLAVE_OF]->(another-database)
(user)-[:USER_OF]->(app)-[:RUNS_ON]->(vm)
(user)-[:USER_OF]->(app)-[:RUNS_ON]->(vm)-[:HOSTED_BY]->(server)
(user)-[:USER_OF]->(app)-[:RUNS_ON]->(vm)-[:HOSTED_BY]->(server)-[:IN]->(rack)
(user)-[:USER_OF]->(app)-[:RUNS_ON]->(vm)-[:HOSTED_BY]->(server)-[:IN]->(rack)
  <-[:IN]-(load-balancer)

也就是说,从报告问题的用户开始,我们MATCH沿着长度为 1 到 5 的无向路径对图中的所有资产进行匹配。我们将具有status值为 的属性的资产节点添加down到结果中。如果节点没有属性status,则不会将其包含在结果中。RETURN DISTINCT asset确保无论匹配多少次,结果中都会返回唯一的资产。

That is, starting from the user who reported the problem, we MATCH against all assets in the graph along an undirected path of length 1 to 5. We add asset nodes that have a status property with a value of down to our results. If a node doesn’t have a status property, it won’t be included in the results. RETURN DISTINCT asset ensures that unique assets are returned in the results, no matter how many times they are matched.

鉴于我们的图表很容易支持这样的查询,我们有信心设计适合目的。

Given that such a query is readily supported by our graph, we gain confidence that the design is fit for purpose.

跨域模型

Cross-Domain Models

商业洞察力通常取决于我们对复杂价值链中隐藏的网络效应的理解。为了产生这种理解,我们需要将各个领域连接在一起,而不会扭曲或牺牲每个领域特有的细节。属性图提供了一种解决方案。使用属性图,我们可以将价值链建模为图表,其中特定的关系连接和区分组成子域。

Business insight often depends on us understanding the hidden network effects at play in a complex value chain. To generate this understanding, we need to join domains together without distorting or sacrificing the details particular to each domain. Property graphs provide a solution here. Using a property graph, we can model a value chain as a graph of graphs in which specific relationships connect and distinguish constituent subdomains.

图 3-6中,我们看到了围绕莎士比亚文学作品生产和消费的价值链的图表。这里我们拥有关于莎士比亚及其部分戏剧的高质量信息,以及最近上演过这些戏剧的一家公司的详细信息,以及一个剧院场地和一些地理空间数据。我们甚至还添加了评论。总之,该图描述并连接了三个不同的领域。在图中,我们用不同格式的关系区分了这三个领域:虚线表示文学领域,实线表示戏剧领域,虚线表示地理空间领域。

In Figure 3-6, we see a graph representation of the value chain surrounding the production and consumption of Shakespearean literature. Here we have high-quality information about Shakespeare and some of his plays, together with details of one of the companies that has recently performed the plays, plus a theatrical venue, and some geospatial data. We’ve even added a review. In all, the graph describes and connects three different domains. In the diagram we’ve distinguished these three domains with differently formatted relationships: dotted for the literary domain, solid for the theatrical domain, and dashed for the geospatial domain.

首先来看文学领域,我们有一个代表莎士比亚本人的节点,带有标签Author和属性firstname:'William'lastname:'Shakespeare'。该节点通过名为 的关系连接到一对节点,每个节点都标记为Play,代表戏剧《凯撒大帝》title:'Julius Caesar')和《暴风雨》title:'The Tempest'WROTE_PLAY

Looking first at the literary domain, we have a node that represents Shakespeare himself, with a label Author and properties firstname:'William' and lastname:'Shakespeare'. This node is connected to a pair of nodes, each of which is labeled Play, representing the plays Julius Caesar (title:'Julius Caesar') and The Tempest (title:'The Tempest'), via relationships named WROTE_PLAY.

格数据库 0306
图 3-6。一个图中的三个域

从左到右阅读这个子图,顺着关系箭头的方向,我们可以知道作者威廉·莎士比亚创作了戏剧《凯撒大帝》《暴风雨》。如果我们对出处感兴趣,每个WROTE_PLAY关系都有一个date属性,它告诉我们《凯撒大帝》写于 1599 年,而《暴风雨》WROTE_PLAY写于 1610 年。我们可以很容易地看到如何将莎士比亚的其他作品(戏剧和诗歌)添加到图中,只需添加更多节点来表示每部作品,然后通过和关系将它们连接到莎士比亚节点即可WROTE_POEM

Reading this subgraph left-to-right, following the direction of the relationship arrows, tells us that the author William Shakespeare wrote the plays Julius Caesar and The Tempest. If we’re interested in provenance, each WROTE_PLAY relationship has a date property, which tells us that Julius Caesar was written in 1599 and The Tempest in 1610. It’s a trivial matter to see how we could add the rest of Shakespeare’s works — the plays and the poems — into the graph simply by adding more nodes to represent each work, and joining them to the Shakespeare node via WROTE_PLAY and WROTE_POEM relationships.


笔记

通过用手指追踪WROTE_PLAY关系箭头,我们实际上正在做图形数据库执行的工作,尽管是以人的速度而不是计算机的速度。正如我们稍后将看到的,这种简单的遍历操作是任意复杂的图形查询的基础。

By tracing the WROTE_PLAY relationship arrows with our finger, we’re effectively doing the kind of work that a graph database performs, albeit at human speed rather than computer speed. As we’ll see later, this simple traversal operation is the building block for arbitrarily sophisticated graph queries.


接下来转向戏剧领域,我们以节点的形式添加了一些有关皇家莎士比亚剧团(通常简称为RSCCompany )的信息,节点的标签为,属性键name的值为。戏剧领域与文学领域联系在一起,这并不令人意外。在我们的图表中,RSC 拥有《凯撒大帝》《暴风雨》的版本RSC,这反映了这一点。反过来,这些戏剧作品通过关系与文学领域的戏剧相联系。PRODUCEDPRODUCTION_OF

Turning next to the theatrical domain, we’ve added some information about the Royal Shakespeare Company (often known simply as the RSC) in the form of a node with the label Company and a property key name whose value is RSC. The theatrical domain is, unsurprisingly, connected to the literary domain. In our graph, this is reflected by the fact that the RSC has PRODUCED versions of Julius Caesar and The Tempest. In turn, these theatrical productions are connected to the plays in the literary domain, using PRODUCTION_OF relationships.

该图还捕获了特定演出的详细信息。例如,皇家莎士比亚剧团的《凯撒大帝》于 2012 年 7 月 29 日上演,这是皇家莎士比亚剧团夏季巡演的一部分。如果我们对演出场地感兴趣,我们只需VENUE从演出节点跟踪传出关系,即可发现该剧在皇家剧院上演,由标记为 的节点表示Venue

The graph also captures details of specific performances. For example, the RSC’s production of Julius Caesar was performed on July 29, 2012 as part of the RSC’s summer touring season. If we’re interested in the performance venue, we simply follow the outgoing VENUE relationship from the performance node to find that the play was performed at the Theatre Royal, represented by a node labeled Venue.

该图还允许我们捕获特定表演的评论。在我们的示例图中,我们仅包含一条评论,针对 7 月 29 日的表演,由用户 Billy 撰写。我们可以在表演、评分和用户节点的相互作用中看到这一点。在本例中,我们有一个标记User为代表 Billy 的节点(具有属性name:'Billy'),其传出WROTE_REVIEW关系连接到代表他的评论的节点。该Review节点包含一个数字rating属性和一个自由文本review属性。评论Performance通过传出REVIEW_OF关系链接到特定内容。为了将其扩展到许多用户、许多评论和许多表演,我们只需向图中添加更多具有适当标签和更多同名关系的节点。

The graph also allows us to capture reviews of specific performances. In our sample graph we’ve included just one review, for the July 29 performance, written by the user Billy. We can see this in the interplay of the performance, rating, and user nodes. In this case we have a node labeled User representing Billy (with property name:'Billy') whose outgoing WROTE_REVIEW relationship connects to a node representing his review. The Review node contains a numeric rating property and a free-text review property. The review is linked to a specific Performance through an outgoing REVIEW_OF relationship. To scale this up to many users, many reviews, and many performances, we simply add more nodes with the appropriate labels and more identically named relationships to the graph.

第三个领域是地理空间数据领域,由一个简单的位置层次树组成。该地理空间领域在图中几个点与其他两个领域相连。City埃文河畔斯特拉特福 (具有属性name:'Stratford upon Avon') 与文学领域相连,因为它是莎士比亚的出生地(莎士比亚出生在BORN_IN斯特拉特福)。它与戏剧领域相连,因为它是皇家莎士比亚剧团的所在地(皇家莎士比亚出生在BASED_IN斯特拉特福)。要了解有关埃文河畔斯特拉特福地理的更多信息,我们可以跟踪其传出COUNTRY关系以发现它位于Country名为英格兰的地方。

The third domain, that of geospatial data, comprises a simple hierarchical tree of places. This geospatial domain is connected to the other two domains at several points in the graph. The City of Stratford upon Avon (with property name:'Stratford upon Avon') is connected to the literary domain as a result of its being Shakespeare’s birthplace (Shakespeare was BORN_IN Stratford). It is connected to the theatrical domain insofar as it is home to the RSC (the RSC is BASED_IN Stratford). To learn more about Stratford upon Avon’s geography, we can follow its outgoing COUNTRY relationship to discover it is in the Country named England.


笔记

请注意,该图如何减少跨域重复数据的实例。例如,埃文河畔斯特拉特福 (Stratford upon Avon)参与了所有三个域。

Note how the graph reduces instances of duplicated data across domains. Stratford upon Avon, for example, participates in all three domains.


该图可以捕获更复杂的地理空间数据。例如,查看皇家剧院所连接的节点上的标签,我们可以看到它位于Grey Street,位于City纽卡斯尔的 ,位于County泰恩-威尔郡的 ,最终位于英格兰Country的 —— 就像埃文河畔斯特拉特福一样。

The graph makes it possible to capture more complex geospatial data. Looking at the labels on the nodes to which the Theatre Royal is connected, for example, we see that it is located on Grey Street, which is in the City of Newcastle, which is in the County of Tyne and Wear, which ultimately is in the Country of England — just like Stratford upon Avon.

创建莎士比亚图表

Creating the Shakespeare Graph

为了创建图 3-6所示的莎士比亚图,我们使用CREATE来构建整体结构。此语句由 Cypher 运行时在单个事务中执行,因此一旦执行了该语句,我们就可以确信该图完整地存在于数据库中。如果事务失败,则数据库中将不包含该图的任何部分。正如我们所料,Cypher 有一种人性化且直观的构建图的方式:

To create the Shakespeare graph shown in Figure 3-6, we use CREATE to build the overall structure. This statement is executed by the Cypher runtime within a single transaction such that once the statement has executed, we can be confident the graph is present in its entirety in the database. If the transaction fails, no part of the graph will be present in the database. As we might expect, Cypher has a humane and visual way of building graphs:

CREATE (shakespeare:Author {firstname:'William', lastname:'Shakespeare'}),
       (juliusCaesar:Play {title:'Julius Caesar'}),
       (shakespeare)-[:WROTE_PLAY {year:1599}]->(juliusCaesar),
       (theTempest:Play {title:'The Tempest'}),
       (shakespeare)-[:WROTE_PLAY {year:1610}]->(theTempest),
       (rsc:Company {name:'RSC'}),
       (production1:Production {name:'Julius Caesar'}),
       (rsc)-[:PRODUCED]->(production1),
       (production1)-[:PRODUCTION_OF]->(juliusCaesar),
       (performance1:Performance {date:20120729}),
       (performance1)-[:PERFORMANCE_OF]->(production1),
       (production2:Production {name:'The Tempest'}),
       (rsc)-[:PRODUCED]->(production2),
       (production2)-[:PRODUCTION_OF]->(theTempest),
       (performance2:Performance {date:20061121}),
       (performance2)-[:PERFORMANCE_OF]->(production2),
       (performance3:Performance {date:20120730}),
       (performance3)-[:PERFORMANCE_OF]->(production1),
       (billy:User {name:'Billy'}),
       (review:Review {rating:5, review:'This was awesome!'}),
       (billy)-[:WROTE_REVIEW]->(review),
       (review)-[:RATED]->(performance1),
       (theatreRoyal:Venue {name:'Theatre Royal'}),
       (performance1)-[:VENUE]->(theatreRoyal),
       (performance2)-[:VENUE]->(theatreRoyal),
       (performance3)-[:VENUE]->(theatreRoyal),
       (greyStreet:Street {name:'Grey Street'}),
       (theatreRoyal)-[:STREET]->(greyStreet),
       (newcastle:City {name:'Newcastle'}),
       (greyStreet)-[:CITY]->(newcastle),
       (tyneAndWear:County {name:'Tyne and Wear'}),
       (newcastle)-[:COUNTY]->(tyneAndWear),
       (england:Country {name:'England'}),
       (tyneAndWear)-[:COUNTRY]->(england),
       (stratford:City {name:'Stratford upon Avon'}),
       (stratford)-[:COUNTRY]->(england),
       (rsc)-[:BASED_IN]->(stratford),
       (shakespeare)-[:BORN_IN]->stratford
CREATE (shakespeare:Author {firstname:'William', lastname:'Shakespeare'}),
       (juliusCaesar:Play {title:'Julius Caesar'}),
       (shakespeare)-[:WROTE_PLAY {year:1599}]->(juliusCaesar),
       (theTempest:Play {title:'The Tempest'}),
       (shakespeare)-[:WROTE_PLAY {year:1610}]->(theTempest),
       (rsc:Company {name:'RSC'}),
       (production1:Production {name:'Julius Caesar'}),
       (rsc)-[:PRODUCED]->(production1),
       (production1)-[:PRODUCTION_OF]->(juliusCaesar),
       (performance1:Performance {date:20120729}),
       (performance1)-[:PERFORMANCE_OF]->(production1),
       (production2:Production {name:'The Tempest'}),
       (rsc)-[:PRODUCED]->(production2),
       (production2)-[:PRODUCTION_OF]->(theTempest),
       (performance2:Performance {date:20061121}),
       (performance2)-[:PERFORMANCE_OF]->(production2),
       (performance3:Performance {date:20120730}),
       (performance3)-[:PERFORMANCE_OF]->(production1),
       (billy:User {name:'Billy'}),
       (review:Review {rating:5, review:'This was awesome!'}),
       (billy)-[:WROTE_REVIEW]->(review),
       (review)-[:RATED]->(performance1),
       (theatreRoyal:Venue {name:'Theatre Royal'}),
       (performance1)-[:VENUE]->(theatreRoyal),
       (performance2)-[:VENUE]->(theatreRoyal),
       (performance3)-[:VENUE]->(theatreRoyal),
       (greyStreet:Street {name:'Grey Street'}),
       (theatreRoyal)-[:STREET]->(greyStreet),
       (newcastle:City {name:'Newcastle'}),
       (greyStreet)-[:CITY]->(newcastle),
       (tyneAndWear:County {name:'Tyne and Wear'}),
       (newcastle)-[:COUNTY]->(tyneAndWear),
       (england:Country {name:'England'}),
       (tyneAndWear)-[:COUNTRY]->(england),
       (stratford:City {name:'Stratford upon Avon'}),
       (stratford)-[:COUNTRY]->(england),
       (rsc)-[:BASED_IN]->(stratford),
       (shakespeare)-[:BORN_IN]->stratford

上述 Cypher 代码执行了两件不同的事情。它创建带标签的节点(及其属性),然后将它们与关系(以及必要时的关系属性)连接起来。例如,CREATE (shakespeare:Author {firstname:'William', lastname:'Shakespeare'})创建一个Author代表威廉·莎士比亚的节点。新创建的节点被分配给标识符shakespeare。此shakespeare标识符稍后在代码中用于将关系附加到底层节点。例如,(shakespeare)-[:WROTE_PLAY {year:1599}]->(juliusCaesar)创建莎士比亚戏剧《凯撒大帝》WROTE的关系。此关系具有值为 的属性。year1599

The preceding Cypher code does two different things. It creates labeled nodes (and their properties), and then connects them with relationships (and their relationship properties where necessary). For example, CREATE (shakespeare:Author {firstname:'William', lastname:'Shakespeare'}) creates an Author node representing William Shakespeare. The newly created node is assigned to the identifier shakespeare. This shakespeare identifier is used later in the code to attach relationships to the underlying node. For example, (shakespeare)-[:WROTE_PLAY {year:1599}]->(juliusCaesar) creates a WROTE relationship from Shakespeare to the play Julius Caesar. This relationship has a year property with value 1599.

标识符在当前查询范围的持续时间内保持可用,但不再可用。如果我们希望为节点提供长期存在的名称,我们只需为特定标签和关键属性组合创建索引即可。我们将在“索引和约束”中讨论索引。

Identifiers remain available for the duration of the current query scope, but no longer. Should we wish to give long-lived names to nodes, we simply create an index for a particular label and key property combination. We discuss indexes in see “Indexes and Constraints”.


笔记

与关系模型不同,这些命令不会给图带来任何意外的复杂性。信息元模型(即通过标签和关系构建节点)与业务数据分开,业务数据仅以属性的形式存在。我们不再需要担心外键和基数约束会污染我们的实际数据,因为两者都以节点和将它们相互连接的语义丰富的关系的形式在图形模型中明确显示。

Unlike the relational model, these commands don’t introduce any accidental complexity into the graph. The information meta-model — that is, the structuring of nodes through labels and relationships — is kept separate from the business data, which lives exclusively as properties. We no longer have to worry about foreign key and cardinality constraints polluting our real data, because both are explicit in the graph model in the form of nodes and the semantically rich relationships that interconnect them.


我们可以在稍后的时间点以两种不同的方式修改图表。当然,我们可以继续使用CREATE语句简单地添加到图表中。但我们也可以使用MERGE,其语义是确保在执行命令后,节点和关系的特定子图结构(其中一些可能已经存在,一些可能缺失)到位。在实践中,我们倾向于CREATE在向图表中添加内容并且不介意重复时以及MERGE在域不允许重复时使用。

We can modify the graph at a later point in time in two different ways. We can, of course, continue using CREATE statements to simply add to the graph. But we can also use MERGE, which has the semantics of ensuring that a particular subgraph structure of nodes and relationships — some of which may already exist, some of which may be missing — is in place once the command has executed. In practice, we tend to use CREATE when we’re adding to the graph and don’t mind duplication, and MERGE when duplication is not permitted by the domain.

开始查询

Beginning a Query

现在我们有了图谱,我们可以开始查询了。在 Cypher 中,我们总是从图中一个或多个众所周知的起点开始查询——称为绑定MATCH节点。Cypher 使用和子句中提供的任何标签和属性谓词WHERE以及索引和约束提供的元数据来查找锚定图形模式的起点。

Now that we have a graph, we can start to query it. In Cypher we always begin our queries from one or more well-known starting points in the graph — what are called bound nodes. Cypher uses any labels and property predicates supplied in the MATCH and WHERE clauses, together with metadata supplied by indexes and constraints, to find the starting points that anchor our graph patterns.

例如,如果我们想了解有关皇家剧院演出的更多信息,我们会从皇家剧院节点开始查询,通过指定其Venue标签和name属性可以找到该节点。但是,如果我们对某人的评论更感兴趣,我们会使用该人的节点作为查询的起点,匹配User标签和name属性组合。

For instance, if we wanted to discover more about performances at the Theatre Royal, we’d start our query from the Theatre Royal node, which we find by specifying its Venue label and name property. If, however, we were more interested in a person’s reviews, we’d use that person’s node as a starting point for our query, matching on the User label and name property combination.

假设我们想要了解在纽卡斯尔皇家剧院上演的所有莎士比亚戏剧。这三个事物(一个Author名为莎士比亚的、一个Venue名为皇家剧院的、一个City名为纽卡斯尔的)为我们的新查询提供了起点:

Let’s assume we want to find out about all the Shakespeare events that have taken place in the Theatre Royal in Newcastle. These three things — an Author named Shakespeare, a Venue called Theatre Royal, and a City with the name Newcastle — provide the starting points for our new query:

MATCH (theater:Venue {name:'Theatre Royal'}),
      (newcastle:City {name:'Newcastle'}),
      (bard:Author {lastname:'Shakespeare'})
MATCH (theater:Venue {name:'Theatre Royal'}),
      (newcastle:City {name:'Newcastle'}),
      (bard:Author {lastname:'Shakespeare'})

此子句标识具有属性键和属性值的MATCH所有节点,并将它们绑定到标识符。 (如果此图中有许多节点怎么办?我们稍后会处理这个问题。)下一步,我们找到代表纽卡斯尔的 的节点;我们将此节点绑定到标识符。 最后,与我们之前的莎士比亚查询一样,要找到莎士比亚节点本身,我们寻找具有标签和属性的节点,其值为。 我们将此查找的结果绑定到。VenuenameTheatre RoyaltheaterTheatre RoyalCitynewcastleAuthorlastnameShakespearebard

This MATCH clause identifies all Venue nodes with a property key name and property value Theatre Royal and binds them to the identifier theater. (What if there are many Theatre Royal nodes in this graph? We’ll deal with that shortly.) As the next step, we find the node representing the City of Newcastle; we bind this node to the identifier newcastle. Finally, as with our earlier Shakespeare query, to find the Shakespeare node itself, we look for a node with the label Author and a lastname property whose value is Shakespeare. We bind the result of this lookup to bard.

从现在开始,在我们的查询中,无论我们在模式中使用标识符theaternewcastlebard,该模式都将锚定到与这三个标识符相关联的实际节点。实际上,此信息将查询绑定到图的特定部分,为我们提供了起点,以便从中匹配直接周围的节点和关系中的模式。

From now on in our query, wherever we use the identifiers theater, newcastle, and bard in a pattern, that pattern will be anchored to the real nodes associated with these three identifiers. In effect, this information binds the query to a specific part of the graph, giving us starting points from which to match patterns in the immediately surrounding nodes and relationships.

声明要查找的信息模式

Declaring Information Patterns to Find

Cypher 中的子句MATCH是魔法发生的地方。该CREATE子句试图使用 ASCII 艺术来描述图形的期望状态,并试图传达意图,因此该MATCH子句使用相同的语法来描述要在数据库中发现的模式。我们已经看过一个非常简单的MATCH子句;现在我们将看一个更复杂的模式,它可以找到纽卡斯尔皇家剧院的所有莎士比亚表演:

The MATCH clause in Cypher is where the magic happens. As much as the CREATE clause tries to convey intent using ASCII art to describe the desired state of the graph, so the MATCH clause uses the same syntax to describe patterns to discover in the database. We’ve already looked at a very simple MATCH clause; now we’ll look at a more complex pattern that finds all the Shakespeare performances at Newcastle’s Theatre Royal:

MATCH (theater:Venue {name:'Theatre Royal'}),
      (newcastle:City {name:'Newcastle'}),
      (bard:Author {lastname:'Shakespeare'}),
      (newcastle)<-[:STREET|CITY*1..2]-(theater)
        <-[:VENUE]-()-[:PERFORMANCE_OF]->()
        -[:PRODUCTION_OF]->(play)<-[:WROTE_PLAY]-(bard)
RETURN DISTINCT play.title AS play
MATCH (theater:Venue {name:'Theatre Royal'}),
      (newcastle:City {name:'Newcastle'}),
      (bard:Author {lastname:'Shakespeare'}),
      (newcastle)<-[:STREET|CITY*1..2]-(theater)
        <-[:VENUE]-()-[:PERFORMANCE_OF]->()
        -[:PRODUCTION_OF]->(play)<-[:WROTE_PLAY]-(bard)
RETURN DISTINCT play.title AS play

MATCH模式使用了我们尚未遇到的几个句法元素。除了我们之前讨论过的锚定节点外,它还使用了模式节点、任意深度路径和匿名节点。让我们依次看一下这些元素:

This MATCH pattern uses several syntactic elements we’ve not yet come across. As well as anchored nodes that we discussed earlier, it uses pattern nodes, arbitrary depth paths, and anonymous nodes. Let’s take a look at each of these in turn:

  • 标识符newcastletheaterbard根据指定的标签和属性值锚定到图中的真实节点。
  • The identifiers newcastle, theater, and bard are anchored to real nodes in the graph based on the specified label and property values.
  • 如果我们的数据库中有几座皇家剧院(例如,英国城市普利茅斯、巴斯、温彻斯特和诺里奇都有一座皇家剧院),那么theater将被绑定到所有这些节点。为了将我们的模式限制在纽卡斯尔的皇家剧院,我们使用语法<-[:STREET|CITY*1..2]-,这意味着theater节点与代表泰恩河畔纽卡斯尔市的节点之间的距离不能超过两个传出STREET和/或CITY关系。通过提供可变深度路径,我们允许相对细粒度的地址层次结构(例如,包括街道、区或自治市镇和城市)。
  • If there are several Theatre Royals in our database (the British cities of Plymouth, Bath, Winchester, and Norwich all have a Theatre Royal, for example), then theater will be bound to all these nodes. To restrict our pattern to the Theatre Royal in Newcastle, we use the syntax <-[:STREET|CITY*1..2]-, which means the theater node can be no more than two outgoing STREET and/or CITY relationships away from the node representing the city of Newcastle-upon-Tyne. By providing a variable depth path, we allow for relatively fine-grained address hierarchies (comprising, for example, street, district or borough, and city).
  • 该语法(theater)<-[:VENUE]-()使用匿名节点,因此括号中为空。我们了解数据,因此我们期望匿名节点与表演相匹配,但由于我们不想在查询或结果中的其他地方使用个别表演的详细信息,因此我们不命名节点或将其绑定到标识符。
  • The syntax (theater)<-[:VENUE]-() uses the anonymous node, hence the empty parentheses. Knowing the data as we do, we expect the anonymous node to match performances, but because we’re not interested in using the details of individual performances elsewhere in the query or in the results, we don’t name the node or bind it to an identifier.
  • 我们再次使用匿名节点将演出与制作联系起来(()-[:PERFORMANCE_OF]->())。如果我们有兴趣返回演出和制作的详细信息,我们会用标识符替换这些匿名节点的出现:(performance)-[:PERFORMANCE_OF]->(production)
  • We use the anonymous node again to link the performance to the production (()-[:PERFORMANCE_OF]->()). If we were interested in returning details of performances and productions, we would replace these occurrences of the anonymous node with identifiers: (performance)-[:PERFORMANCE_OF]->(production).
  • 的其余部分MATCH是一个简单的(play)<-[:WROTE_PLAY]-(bard)节点到关系到节点模式匹配。此模式确保我们只返回莎士比亚写的剧本。由于(play)连接到匿名制作节点,并通过连接到表演节点,我们可以安全地推断它已在纽卡斯尔皇家剧院上演。在命名剧本节点时,我们将其纳入范围,以便我们可以稍后在查询中使用它。
  • The remainder of the MATCH is a straightforward (play)<-[:WROTE_PLAY]-(bard) node-to-relationship-to-node pattern match. This pattern ensures that we only return plays written by Shakespeare. Because (play) is joined to the anonymous production node, and by way of that to the performance node, we can safely infer that it has been performed in Newcastle’s Theatre Royal. In naming the play node we bring it into scope so that we can use it later in the query.

运行此查询将返回在纽卡斯尔皇家剧院上演的所有莎士比亚戏剧:

Running this query returns all the Shakespeare plays that have been performed at the Theatre Royal in Newcastle:

+-----------------+
| 播放 |
+-----------------+
| “凯撒大帝”
| “暴风雨” |
+-----------------+
2 行
+-----------------+
| play            |
+-----------------+
| "Julius Caesar" |
| "The Tempest"   |
+-----------------+
2 rows

如果我们对皇家剧院的莎士比亚的整个历史感兴趣,那么这很好,但如果我们只对特定的戏剧、作品或表演感兴趣,我们需要以某种方式限制结果集。

This is great if we’re interested in the entire history of Shakespeare at the Theatre Royal, but if we’re interested only in specific plays, productions, or performances, we need somehow to constrain the set of results.

限制匹配

Constraining Matches

我们使用子句约束图匹配WHEREWHERE允许我们通过规定以下一项或多项来从结果中消除匹配的子图:

We constrain graph matches using the WHERE clause. WHERE allows us to eliminate matched subgraphs from the results by stipulating one or more of the following:

  • 匹配的子图中必须存在(或不存在)某些路径。
  • That certain paths must be present (or absent) in the matched subgraphs.
  • 该节点必须具有某些标签或与某些名称有关系。
  • That nodes must have certain labels or relationships with certain names.
  • 匹配节点和关系上的特定属性必须存在(或不存在),无论其值如何。
  • That specific properties on matched nodes and relationships must be present (or absent), irrespective of their values.
  • 匹配节点和关系上的某些属性必须具有特定值。
  • That certain properties on matched nodes and relationships must have specific values.
  • 必须满足其他条件(例如,履行必须在某个日期或之前发生)。
  • That other predicates must be satisfied (e.g., that performances must have occurred on or before a certain date).

相比MATCH子句描述了结构关系并为模式的各个部分分配标识符,WHERE限制了当前的模式匹配。例如,让我们想象一下,我们想要将结果中的戏剧范围限制为莎士比亚最后时期的戏剧,该时期通常被认为是 1608 年左右开始的。我们通过过滤year匹配WROTE_PLAY关系的属性来实现这一点。为了启用此过滤,我们调整了该MATCH子句,将WROTE_PLAY关系绑定到一个标识符,我们将其称为w(关系标识符位于关系名称前缀的冒号之前)。我们然后添加一个WHERE子句来过滤此关系的year属性:

Compared to the MATCH clause, which describes structural relationships and assigns identifiers to parts of the pattern, WHERE constrains the current pattern match. Let’s imagine, for example, that we want to restrict the range of plays in our results to those from Shakespeare’s final period, which is generally accepted to have begun around 1608. We do this by filtering on the year property of matched WROTE_PLAY relationships. To enable this filtering, we adjust the MATCH clause, binding the WROTE_PLAY relationship to an identifier, which we’ll call w (relationship identifiers come before the colon that prefixes a relationship’s name). We then add a WHERE clause that filters on this relationship’s year property:

MATCH (theater:Venue {name:'Theatre Royal'}),
      (newcastle:City {name:'Newcastle'}),
      (bard:Author {lastname:'Shakespeare'}),
      (newcastle)<-[:STREET|CITY*1..2]-(theater)
        <-[:VENUE]-()-[:PERFORMANCE_OF]->()
        -[:PRODUCTION_OF]->(play)<-[w:WROTE_PLAY]-(bard)
WHERE w.year > 1608
RETURN DISTINCT play.title AS play
MATCH (theater:Venue {name:'Theatre Royal'}),
      (newcastle:City {name:'Newcastle'}),
      (bard:Author {lastname:'Shakespeare'}),
      (newcastle)<-[:STREET|CITY*1..2]-(theater)
        <-[:VENUE]-()-[:PERFORMANCE_OF]->()
        -[:PRODUCTION_OF]->(play)<-[w:WROTE_PLAY]-(bard)
WHERE w.year > 1608
RETURN DISTINCT play.title AS play

添加此WHERE子句意味着,对于每个成功的匹配,数据库都会检查WROTE_PLAY莎士比亚节点与匹配剧本之间的关系是否具有year值大于 1608 的属性。关系值大于 1608WROTE_PLAY的匹配year将通过测试;然后这些剧本将包含在结果中。未通过测试的匹配将不会包含在结果中。通过添加此子句,我们确保只返回莎士比亚晚期的剧本:

Adding this WHERE clause means that for each successful match, the database checks that the WROTE_PLAY relationship between the Shakespeare node and the matched play has a year property with a value greater than 1608. Matches with a WROTE_PLAY relationship whose year value is greater than 1608 will pass the test; these plays will then be included in the results. Matches that fail the test will not be included in the results. By adding this clause, we ensure that only plays from Shakespeare’s late period are returned:

+---------------+
| 播放 |
+---------------+
| “暴风雨” |
+---------------+
1 行
+---------------+
| play          |
+---------------+
| "The Tempest" |
+---------------+
1 row

处理结果

Processing Results

密码RETURN条款允许我们在将匹配的图形数据返回给执行查询的用户(或应用程序)之前对其进行一些处理。

Cypher’s RETURN clause allows us to perform some processing on the matched graph data before returning it to the user (or the application) that executed the query.

正如我们在前面的查询中看到的,我们能做的最简单的事情就是返回我们找到的剧本:

As we’ve seen in the previous queries, the simplest thing we can do is return the plays we’ve found:

RETURN DISTINCT play.title AS play
RETURN DISTINCT play.title AS play

DISTINCT确保确保返回唯一结果。由于每部戏剧可能在同一家剧院上演多次,有时甚至在不同的制作中上演,因此我们最终可能会得到重复的戏剧标题。DISTINCT过滤掉这些。

DISTINCT ensures that we return unique results. Because each play can be performed multiple times in the same theater, sometimes in different productions, we can end up with duplicate play titles. DISTINCT filters these out.

我们可以通过多种方式丰富此结果,包括聚合、排序、过滤和限制返回的数据。例如,如果我们只对符合我们标准的播放次数count感兴趣,我们可以应用以下函数:

We can enrich this result in several ways, including aggregating, ordering, filtering, and limiting the returned data. For example, if we’re only interested in the number of plays that match our criteria, we apply the count function:

RETURN count(play)
RETURN count(play)

如果我们想根据演出场次对戏剧进行排名,我们首先需要将子句PERFORMANCE_OF中的关系绑定MATCH到一个名为的标识符,p然后我们可以对其count进行和排序:

If we want to rank the plays by the number of performances, we’ll need first to bind the PERFORMANCE_OF relationship in the MATCH clause to an identifier, called p, which we can then count and order:

MATCH (theater:Venue {name:'Theatre Royal'}),
      (newcastle:City {name:'Newcastle'}),
      (bard:Author {lastname:'Shakespeare'}),
      (newcastle)<-[:STREET|CITY*1..2]-(theater)
        <-[:VENUE]-()-[p:PERFORMANCE_OF]->()
        -[:PRODUCTION_OF]->(play)<-[:WROTE_PLAY]-(bard)
RETURN   play.title AS play, count(p) AS performance_count
ORDER BY performance_count DESC
MATCH (theater:Venue {name:'Theatre Royal'}),
      (newcastle:City {name:'Newcastle'}),
      (bard:Author {lastname:'Shakespeare'}),
      (newcastle)<-[:STREET|CITY*1..2]-(theater)
        <-[:VENUE]-()-[p:PERFORMANCE_OF]->()
        -[:PRODUCTION_OF]->(play)<-[:WROTE_PLAY]-(bard)
RETURN   play.title AS play, count(p) AS performance_count
ORDER BY performance_count DESC

此处的子句使用标识符(与子句中的关系绑定)RETURN计算关系数,并将结果别名为。然后,它根据 对结果进行排序,最常播放的剧目列在最前面:PERFORMANCE_OFpPERFORMANCE_OFMATCHperformance_countperformance_count

The RETURN clause here counts the number of PERFORMANCE_OF relationships using the identifier p (which is bound to the PERFORMANCE_OF relationships in the MATCH clause) and aliases the result as performance_count. It then orders the results based on performance_count, with the most frequently performed play listed first:

+------------------------------------------------+
| 播放 | performance_count |
+------------------------------------------------+
| “凯撒大帝” | 2 |
| 《暴风雨》 | 1 |
+------------------------------------------------+
2 行
+-------------------------------------+
| play            | performance_count |
+-------------------------------------+
| "Julius Caesar" | 2                 |
| "The Tempest"   | 1                 |
+-------------------------------------+
2 rows

查询链

Query Chaining

在我们结束对 Cypher 的简短介绍之前,还有一个有用的功能需要了解——子句WITH。有时用一个 来完成所有事情是不切实际的(或不可能的)MATCH。该WITH子句允许我们将多个匹配项链接在一起,并将前一个查询部分的结果传输到下一个查询部分。在下面的例子中,我们找到莎士比亚写的剧本,并根据它们的写作年份对它们进行排序,最新的排在最前面。WITH然后使用 ,我们将结果传输到RETURN子句,该子句使用该collect函数生成以逗号分隔的剧本标题列表:

Before we leave our brief tour of Cypher, there is one more feature that is useful to know about — the WITH clause. Sometimes it’s just not practical (or possible) to do everything you want in a single MATCH. The WITH clause allows us to chain together several matches, with the results of the previous query part being piped into the next. In the following example we find the plays written by Shakespeare, and order them based on the year in which they were written, latest first. Using WITH, we then pipe the results to the RETURN clause, which uses the collect function to produce a comma-delimited list of play titles:

MATCH (bard:Author {lastname:'Shakespeare'})-[w:WROTE_PLAY]->(play)
WITH play
ORDER BY w.year DESC
RETURN collect(play.title) AS plays
MATCH (bard:Author {lastname:'Shakespeare'})-[w:WROTE_PLAY]->(play)
WITH play
ORDER BY w.year DESC
RETURN collect(play.title) AS plays

针对我们的示例图执行此查询将产生以下结果:

Executing this query against our sample graph produces the following result:

+--------------------------------+
| 戏剧 |
+--------------------------------+
| [《暴风雨》、《凯撒大帝》] |
+--------------------------------+
1 行
+---------------------------------+
| plays                           |
+---------------------------------+
| ["The Tempest","Julius Caesar"] |
+---------------------------------+
1 row

WITH可用于将只读子句与以写为中心的SET操作分开。更一般地说,WITH它允许我们将单个复杂查询分解为几个更简单的模式,从而帮助分而治之复杂查询。

WITH can be used to separate read-only clauses from write-centric SET operations. More generally, WITH helps divide and conquer complex queries by allowing us to break a single complex query into several simpler patterns.

常见的建模陷阱

Common Modeling Pitfalls

尽管图形建模是一种非常具有表现力的方式来掌握问题领域的复杂性,但仅凭表现力并不能保证特定图形符合目的。事实上,即使是我们这些每天使用图形的人也会犯错误。在本节中,我们将研究一个出错的模型。通过这样做,我们将学习如何在建模工作的早期识别问题以及如何修复它们。

Although graph modeling is a very expressive way of mastering the complexity in a problem domain, expressivity alone is no guarantee that a particular graph is fit for purpose. In fact, there have been occasions where even those of us who work with graphs every day make mistakes. In this section we’ll take a look at a model that went wrong. In so doing, we’ll learn how to identify problems early in the modeling effort, and how to fix them.

电子邮件来源问题域

Email Provenance Problem Domain

此示例涉及电子邮件通信分析。通信模式分析是一个经典的图谱问题,它涉及查询图谱以发现主题专家、关键影响者以及传播信息的沟通渠道。然而,在这种情况下,我们不是寻找积极的榜样(以专家的形式),而是寻找流氓:即违反公司治理甚至违法的可疑电子邮件通信模式。

This example involves the analysis of email communications. Communication pattern analysis is a classic graph problem that involves interrogating the graph to discover subject matter experts, key influencers, and the communication channels through which information is propagated. On this occasion, however, instead of looking for positive role models (in the form of experts), we were searching for rogues: that is, suspicious patterns of email communication that fall foul of corporate governance — or even break the law.

第一次迭代是否合理?

A Sensible First Iteration?

在分析该域名时,我们了解到潜在犯罪者为掩盖行踪而采用的所有巧妙模式:使用密件抄送 (BCC)、使用别名 — 甚至与这些别名进行对话以模仿真实业务利益相关者之间的合法互动。基于此分析,我们制作了一个代表性图形模型,该模型似乎可以捕获所有相关实体及其活动。

In analyzing the domain we learned about all the clever patterns that potential wrong-doers adopt to cover their tracks: using blind-copying (BCC), using aliases — even conducting conversations with those aliases to mimic legitimate interactions between real business stakeholders. Based on this analysis we produced a representative graph model that seemed to capture all the relevant entities and their activities.

为了说明这个早期模型,我们将使用 CypherCREATE子句生成一些表示用户和别名的节点。我们还将生成一个关系,表明 Alice 是 Bob 的已知别名之一。(我们假设底层图形数据库正在索引这些节点,以便我们以后可以查找它们并将它们用作查询的起点。)以下是用于创建我们的第一个图形的 Cypher 查询:

To illustrate this early model, we’ll use Cypher’s CREATE clause to generate some nodes representing users and aliases. We’ll also generate a relationship that shows that Alice is one of Bob’s known aliases. (We’ll assume that the underlying graph database is indexing these nodes so that we can later look them up and use them as starting points in our queries.) Here’s the Cypher query to create our first graph:

CREATE (alice:User {username:'Alice'}),
       (bob:User {username:'Bob'}),
       (charlie:User {username:'Charlie'}),
       (davina:User {username:'Davina'}),
       (edward:User {username:'Edward'}),
       (alice)-[:ALIAS_OF]->(bob)
CREATE (alice:User {username:'Alice'}),
       (bob:User {username:'Bob'}),
       (charlie:User {username:'Charlie'}),
       (davina:User {username:'Davina'}),
       (edward:User {username:'Edward'}),
       (alice)-[:ALIAS_OF]->(bob)

通过生成的图形模型可以很容易地观察到Alice 是 Bob 的别名,如图 3-7所示。

The resulting graph model makes it easy to observe that Alice is an alias of Bob, as we see in Figure 3-7.

格德布 0307
图 3-7.用户和别名

现在我们通过用户交换的电子邮件将他们联系在一起:

Now we join the users together through the emails they’ve exchanged:

MATCH  (bob:User {username:'Bob'}),
       (charlie:User {username:'Charlie'}),
       (davina:User {username:'Davina'}),
       (edward:User {username:'Edward'})
CREATE (bob)-[:EMAILED]->(charlie),
       (bob)-[:CC]->(davina),
       (bob)-[:BCC]->(edward)
MATCH  (bob:User {username:'Bob'}),
       (charlie:User {username:'Charlie'}),
       (davina:User {username:'Davina'}),
       (edward:User {username:'Edward'})
CREATE (bob)-[:EMAILED]->(charlie),
       (bob)-[:CC]->(davina),
       (bob)-[:BCC]->(edward)

乍一看,这看起来像是对该领域的合理忠实的表示。每个子句都可以轻松地从左到右阅读,从而通过了我们的一项非正式正确性测试。例如,我们可以从图中看到“Bob 给 Charlie 发了电子邮件”。只有当需要确定潜在的不法分子 Bob(和他的另一个自我 Alice)到底交换了什么时,该模型的局限性才会显现出来。我们可以看到 Bob 抄送或密送了一些人,但我们看不到最重要的东西:电子邮件本身。

At first sight this looks like a reasonably faithful representation of the domain. Each clause lends itself to being read comfortably left to right, thereby passing one of our informal tests for correctness. For example, we can see from the graph that “Bob emailed Charlie.” The limitations of this model only emerge when it becomes necessary to determine exactly what was exchanged by the potential miscreant Bob (and his alter ego Alice). We can see that Bob CC’d or BCC’d some people, but we can’t see the most important thing of all: the email itself.

首次建模尝试的结果是一个以 Bob 为中心的星形图。他发送电子邮件、抄送和密件抄送的行为由从 Bob 延伸到代表其邮件收件人的节点的关系表示。然而,正如我们在图 3-8中看到的那样,数据中最关键的元素,即实际的电子邮件,却不见了。

This first modeling attempt results in a star-shaped graph with Bob at the center. His actions of emailing, copying, and blind-copying are represented by relationships that extend from Bob to the nodes representing the recipients of his mail. As we see in Figure 3-8, however, the most critical element of the data, the actual email, is missing.

格数据库 0308
图 3-8.缺少电子邮件节点导致信息丢失

这个图结构是有损的,当我们提出以下查询时,这个事实就变得明显了:

This graph structure is lossy, a fact that becomes evident when we pose the following query:

MATCH (bob:User {username:'Bob'})-[e:EMAILED]->
      (charlie:User {username:'Charlie'})
RETURN e
MATCH (bob:User {username:'Bob'})-[e:EMAILED]->
      (charlie:User {username:'Charlie'})
RETURN e

此查询返回EMAILEDBob 和 Charlie 之间的关系(Bob 发送给 Charlie 的每封电子邮件可能都对应一个关系)。这告诉我们电子邮件已经交换,但没有告诉我们有关电子邮件本身的任何信息:

This query returns the EMAILED relationships between Bob and Charlie (there will likely be one for each email that Bob has sent to Charlie). This tells us that emails have been exchanged, but it tells us nothing about the emails themselves:

+----------------+
| 电子|
+----------------+
| :EMAILED[1] {} |
+----------------+
1 行
+----------------+
| e              |
+----------------+
| :EMAILED[1] {} |
+----------------+
1 row

我们可能认为可以通过向EMAILED关系添加属性来表示电子邮件的属性,从而解决这种情况,但这只是在拖延时间。即使每个EMAILED关系都附加了属性,我们仍然无法将EMAILEDCCBCC关系关联起来;也就是说,我们无法分辨哪些电子邮件是抄送的,哪些是密送的,以及抄送给了谁。

We might think we can remedy the situation by adding properties to the EMAILED relationship to represent an email’s attributes, but that’s just playing for time. Even with properties attached to each EMAILED relationship, we would still be unable to correlate between the EMAILED, CC, and BCC relationships; that is, we would be unable to say which emails were copied versus which were blind-copied, and to whom.

事实上,我们无意中犯了一个简单的建模错误,这主要是由于英语使用不当造成的,而不是图论的缺陷。我们日常的语言使用让我们关注动词“emailed”,而不是电子邮件本身,结果我们创建了一个缺乏真正领域洞察力的模型。

The fact is we’ve unwittingly made a simple modeling mistake, caused mostly by a lax use of English rather than any shortcomings of graph theory. Our everyday use of language has led us to focus on the verb “emailed” rather than the email itself, and as a result we’ve produced a model lacking true domain insight.

在英语中,将短语“Bob 向 Charlie 发送了一封电子邮件”缩短为“Bob 给 Charlie 发了一封电子邮件”既简单又方便。在大多数情况下,名词(实际的电子邮件)的丢失并不重要,因为意图仍然很明确。但在我们的取证场景中,这些省略的陈述是有问题的。意图保持不变,但 Bob 发送的电子邮件的数量、内容和收件人的详细信息已经丢失,因为它们被折叠到关系中EMAILED,而不是被明确地建模为节点。

In English, it’s easy and convenient to shorten the phrase “Bob sent an email to Charlie” to “Bob emailed Charlie.” In most cases, that loss of a noun (the actual email) doesn’t matter because the intent is still clear. But when it comes to our forensics scenario, these elided statements are problematic. The intent remains the same, but the details of the number, contents, and recipients of the emails that Bob sent have been lost through having been folded into a relationship EMAILED, rather than being modeled explicitly as nodes in their own right.

第二次的魅力

Second Time’s the Charm

为了修复我们的有损模型,我们需要插入电子邮件节点来表示企业内部交换的真实电子邮件,并扩展我们的关系名称集以包含电子邮件支持的完整寻址字段集。现在,我们无需创建如下有损结构:

To fix our lossy model, we need to insert email nodes to represent the real emails exchanged within the business, and expand our set of relationship names to encompass the full set of addressing fields that email supports. Now instead of creating lossy structures like this:

CREATE (bob)-[:EMAILED]->(charlie)
CREATE (bob)-[:EMAILED]->(charlie)

我们将创建更详细的结构,如下所示:

we’ll instead create more detailed structures, like this:

CREATE (email_1:Email {id:'1', content:'Hi Charlie, ... Kind regards, Bob'}),
       (bob)-[:SENT]->(email_1),
       (email_1)-[:TO]->(charlie),
       (email_1)-[:CC]->(davina),
       (email_1)-[:CC]->(alice),
       (email_1)-[:BCC]->(edward)
CREATE (email_1:Email {id:'1', content:'Hi Charlie, ... Kind regards, Bob'}),
       (bob)-[:SENT]->(email_1),
       (email_1)-[:TO]->(charlie),
       (email_1)-[:CC]->(davina),
       (email_1)-[:CC]->(alice),
       (email_1)-[:BCC]->(edward)

这会产生另一个星形图结构,但这次电子邮件位于中心,如图3-9所示。

This results in another star-shaped graph structure, but this time the email is at the center, as we see in Figure 3-9.

格德布 0309
图 3-9.基于电子邮件的星形图

当然,在实际系统中,会有更多这样的电子邮件,每封电子邮件都有自己错综复杂的交互网络供我们探索。很容易想象,随着时间的推移,CREATE随着电子邮件服务器记录交互,会执行更多语句,如下所示(为简洁起见,我们省略了锚节点):

Of course, in a real system there will be many more of these emails, each with its own intricate web of interactions for us to explore. It’s quite easy to imagine that over time many more CREATE statements are executed as the email server logs the interactions, like so (we’ve omitted anchor nodes for brevity):

CREATE (email_1:Email {id:'1', content:'email contents'}),
       (bob)-[:SENT]->(email_1),
       (email_1)-[:TO]->(charlie),
       (email_1)-[:CC]->(davina),
       (email_1)-[:CC]->(alice),
       (email_1)-[:BCC]->(edward);

CREATE (email_2:Email {id:'2', content:'email contents'}),
       (bob)-[:SENT]->(email_2),
       (email_2)-[:TO]->(davina),
       (email_2)-[:BCC]->(edward);

CREATE (email_3:Email {id:'3', content:'email contents'}),
       (davina)-[:SENT]->(email_3),
       (email_3)-[:TO]->(bob),
       (email_3)-[:CC]->(edward);

CREATE (email_4:Email {id:'4', content:'email contents'}),
       (charlie)-[:SENT]->(email_4),
       (email_4)-[:TO]->(bob),
       (email_4)-[:TO]->(davina),
       (email_4)-[:TO]->(edward);

CREATE (email_5:Email {id:'5', content:'email contents'}),
       (davina)-[:SENT]->(email_5),
       (email_5)-[:TO]->(alice),
       (email_5)-[:BCC]->(bob),
       (email_5)-[:BCC]->(edward);
CREATE (email_1:Email {id:'1', content:'email contents'}),
       (bob)-[:SENT]->(email_1),
       (email_1)-[:TO]->(charlie),
       (email_1)-[:CC]->(davina),
       (email_1)-[:CC]->(alice),
       (email_1)-[:BCC]->(edward);

CREATE (email_2:Email {id:'2', content:'email contents'}),
       (bob)-[:SENT]->(email_2),
       (email_2)-[:TO]->(davina),
       (email_2)-[:BCC]->(edward);

CREATE (email_3:Email {id:'3', content:'email contents'}),
       (davina)-[:SENT]->(email_3),
       (email_3)-[:TO]->(bob),
       (email_3)-[:CC]->(edward);

CREATE (email_4:Email {id:'4', content:'email contents'}),
       (charlie)-[:SENT]->(email_4),
       (email_4)-[:TO]->(bob),
       (email_4)-[:TO]->(davina),
       (email_4)-[:TO]->(edward);

CREATE (email_5:Email {id:'5', content:'email contents'}),
       (davina)-[:SENT]->(email_5),
       (email_5)-[:TO]->(alice),
       (email_5)-[:BCC]->(bob),
       (email_5)-[:BCC]->(edward);

这导致了我们在图 3-10中看到的更复杂、更有趣的图表。

This leads to the more complex, and interesting, graph we see in Figure 3-10.

格数据库 0310
图 3-10电子邮件交互图

我们现在可以查询此图表来识别潜在的可疑行为:

We can now query this graph to identify potentially suspect behavior:

MATCH (bob:User {username:'Bob'})-[:SENT]->(email)-[:CC]->(alias),
      (alias)-[:ALIAS_OF]->(bob)
RETURN email.id
MATCH (bob:User {username:'Bob'})-[:SENT]->(email)-[:CC]->(alias),
      (alias)-[:ALIAS_OF]->(bob)
RETURN email.id

在这里,我们检索了 Bob 发送的所有电子邮件,这些电子邮件抄送给了他自己的一个别名。任何符合此模式的电子邮件都表明存在恶意行为。由于 Cypher 和底层图形数据库都具有图形亲和性,因此这些查询(即使在大型数据集上)也能非常快速地运行。此查询返回以下结果:

Here we retrieve all the emails that Bob has sent where he’s CC’d one of his own aliases. Any emails that match this pattern are indicative of rogue behavior. And because both Cypher and the underlying graph database have graph affinity, these queries — even over large datasets — run very quickly. This query returns the following results:

+------------------------------------------------------+
| 电子邮件 |
+-----------------------------------------------------+
| Node[6]{id:"1",content:"电子邮件内容"} |
+-----------------------------------------------------+
1 行
+------------------------------------------+
| email                                    |
+------------------------------------------+
| Node[6]{id:"1",content:"email contents"} |
+------------------------------------------+
1 row

领域演进

Evolving the Domain

与任何数据库一样,我们的图服务于一个可能随着时间的推移而发展的系统。那么当图发展时我们应该怎么做呢?我们如何知道哪里出了问题,或者说,我们如何知道有什么东西出了问题?事实上,我们无法完全避免图数据库中的迁移:它们是生活中的现实,就像任何数据存储一样。但在图数据库中,它们通常要简单得多。

As with any database, our graph serves a system that is likely to evolve over time. So what should we do when the graph evolves? How do we know what breaks, or indeed, how do we even tell that something has broken? The fact is, we can’t completely avoid migrations in a graph database: they’re a fact of life, just as with any data store. But in a graph database they’re usually a lot simpler.

在图表中,为了添加新的事实或组合,我们倾向于添加新的节点和关系,而不是更改现有的模型。使用类型的关系添加到图中不会影响任何现有查询,并且完全安全。使用现有关系类型更改图并更改现有节点的属性(而不仅仅是属性值)可能是安全的,但我们需要运行一组有代表性的查询,以确保图在结构更改后仍然适合用途。但是,这些活动与我们在正常数据库操作期间执行的操作完全相同,因此在图形世界中,迁移实际上只是照常进行。

In a graph, to add new facts or compositions, we tend to add new nodes and relationships rather than change the model in place. Adding to the graph using new kinds of relationships will not affect any existing queries, and is completely safe. Changing the graph using existing relationship types, and changing the properties (not just the property values) of existing nodes might be safe, but we need to run a representative set of queries to maintain confidence that the graph is still fit for purpose after the structural changes. However, these activities are precisely the same kinds of actions we perform during normal database operation, so in a graph world a migration really is just business as usual.

此时,我们有一个图表,描述了谁发送和接收了电子邮件,以及电子邮件本身的内容。但当然,电子邮件的乐趣之一是收件人可以转发或回复他们收到的电子邮件。这增加了互动和知识共享,但在某些情况下会泄露关键的业务信息。由于我们正在寻找可疑的通信模式,因此我们也应该考虑转发和回复。

At this point we have a graph that describes who sent and received emails, as well as the content of the emails themselves. But of course, one of the joys of email is that recipients can forward on or reply to an email they’ve received. This increases interaction and knowledge sharing, but in some cases leaks critical business information. Since we’re looking for suspicious communication patterns, it makes sense for us to also take into account forwarding and replies.

乍一看,似乎不需要使用数据库迁移来更新我们的图以支持我们的新用例。我们可以做的最简单的添加是向图中添加FORWARDED和关系,如图 3-11REPLIED_TO所示。这样做不会影响任何预先存在的查询,因为它们没有编码来识别新的关系。

At first glance, there would appear to be no need to use database migrations to update our graph to support our new use case. The simplest additions we can make involve adding FORWARDED and REPLIED_TO relationships to the graph, as shown in Figure 3-11. Doing so won’t affect any preexisting queries because they aren’t coded to recognize the new relationships.

格数据库 0311
图 3-11。一种幼稚的、有损的方法未能认识到转发和回复的电子邮件是一等实体

然而,这种方法很快就被证明是不够的。添加FORWARDEDREPLIED关系是幼稚和有损的,就像我们最初使用关系一样EMAILED。为了说明这一点,请考虑以下CREATE陈述:

However, this approach quickly proves inadequate. Adding FORWARDED or REPLIED relationships is naive and lossy in much the same way as our original use of an EMAILED relationship. To illustrate this, consider the following CREATE statements:

...
MATCH  (email:Email {id:'1234'})
CREATE (alice)-[:REPLIED_TO]->(email)
CREATE (davina)-[:FORWARDED]->(email)-[:TO]->(charlie)
...
MATCH  (email:Email {id:'1234'})
CREATE (alice)-[:REPLIED_TO]->(email)
CREATE (davina)-[:FORWARDED]->(email)-[:TO]->(charlie)

在第一个CREATE语句中,我们试图记录 Alice 回复了某封电子邮件的事实。从左到右阅读时,该语句合乎逻辑,但其情感是有损失的 — 我们无法判断 Alice 是回复了所有收件人email还是直接回复了作者。我们只知道发送了一些回复。第二个语句从左到右阅读也很好:Davina 转发给email了 Charlie。但我们已经使用该TO关系来表明给定的电子邮件具有TO标识主要收件人的标题。TO在这里重复使用使得无法判断谁是收件人以及谁收到了电子邮件的转发版本。

In the first CREATE statement we’re trying to record the fact that Alice replied to a particular email. The statement makes logical sense when read from left to right, but the sentiment is lossy — we can’t tell whether Alice replied to all the recipients of email or directly to the author. All we know is that some reply was sent. The second statement also reads well from left to right: Davina forwarded email to Charlie. But we already use the TO relationship to indicate that a given email has a TO header identifying the primary recipients. Reusing TO here makes it impossible to tell who was a recipient and who received a forwarded version of an email.

为了解决这个问题,我们必须考虑领域的基本原理。对电子邮件的回复本身是一个新的Email,但它也是一个Reply。换句话说,回复有两个角色,在图中可以通过将两个标签Email和附加Reply到回复节点来表示。无论回复是针对原始发件人、所有收件人还是子集,都可以使用相同的熟悉的、和关系轻松建模TOCCBCC原始电子邮件本身可以通过关系引用REPLY_TO。以下是经过多次电子邮件操作后修改的一系列写入内容(再次,我们省略了必要的节点锚定):

To resolve this problem, we have to consider the fundamentals of the domain. A reply to an email is itself a new Email, but it is also a Reply. In other words, the reply has two roles, which in the graph can be represented by attaching two labels, Email and Reply, to the reply node. Whether the reply is to the original sender, all recipients, or a subset can be easily modeled using the same familiar TO, CC, and BCC relationships, while the original email itself can be referenced via a REPLY_TO relationship. Here’s a revised series of writes resulting from several email actions (again, we’ve omitted the necessary anchoring of nodes):

CREATE (email_6:Email {id:'6', content:'email'}),
       (bob)-[:SENT]->(email_6),
       (email_6)-[:TO]->(charlie),
       (email_6)-[:TO]->(davina);

CREATE (reply_1:Email:Reply {id:'7', content:'response'}),
       (reply_1)-[:REPLY_TO]->(email_6),
       (davina)-[:SENT]->(reply_1),
       (reply_1)-[:TO]->(bob),
       (reply_1)-[:TO]->(charlie);

CREATE (reply_2:Email:Reply {id:'8', content:'response'}),
       (reply_2)-[:REPLY_TO]->(email_6),
       (bob)-[:SENT]->(reply_2),
       (reply_2)-[:TO]->(davina),
       (reply_2)-[:TO]->(charlie),
       (reply_2)-[:CC]->(alice);

CREATE (reply_3:Email:Reply {id:'9', content:'response'}),
       (reply_3)-[:REPLY_TO]->(reply_1),
       (charlie)-[:SENT]->(reply_3),
       (reply_3)-[:TO]->(bob),
       (reply_3)-[:TO]->(davina);

CREATE (reply_4:Email:Reply {id:'10', content:'response'}),
       (reply_4)-[:REPLY_TO]->(reply_3),
       (bob)-[:SENT]->(reply_4),
       (reply_4)-[:TO]->(charlie),
       (reply_4)-[:TO]->(davina);
CREATE (email_6:Email {id:'6', content:'email'}),
       (bob)-[:SENT]->(email_6),
       (email_6)-[:TO]->(charlie),
       (email_6)-[:TO]->(davina);

CREATE (reply_1:Email:Reply {id:'7', content:'response'}),
       (reply_1)-[:REPLY_TO]->(email_6),
       (davina)-[:SENT]->(reply_1),
       (reply_1)-[:TO]->(bob),
       (reply_1)-[:TO]->(charlie);

CREATE (reply_2:Email:Reply {id:'8', content:'response'}),
       (reply_2)-[:REPLY_TO]->(email_6),
       (bob)-[:SENT]->(reply_2),
       (reply_2)-[:TO]->(davina),
       (reply_2)-[:TO]->(charlie),
       (reply_2)-[:CC]->(alice);

CREATE (reply_3:Email:Reply {id:'9', content:'response'}),
       (reply_3)-[:REPLY_TO]->(reply_1),
       (charlie)-[:SENT]->(reply_3),
       (reply_3)-[:TO]->(bob),
       (reply_3)-[:TO]->(davina);

CREATE (reply_4:Email:Reply {id:'10', content:'response'}),
       (reply_4)-[:REPLY_TO]->(reply_3),
       (bob)-[:SENT]->(reply_4),
       (reply_4)-[:TO]->(charlie),
       (reply_4)-[:TO]->(davina);

这将创建图 3-12中的图表,其中显示了大量回复和对回复的回复。

This creates the graph in Figure 3-12, which shows numerous replies and replies-to-replies.

现在很容易看出谁回复了 Bob 的原始电子邮件。首先,找到感兴趣的电子邮件,然后匹配所有传入的REPLY_TO关系(可能有多个回复),然后从那里匹配传入的SENT关系:这会显示发件人。在 Cypher 中,这很容易表达。事实上,Cypher 可以轻松查找回复的回复,依此类推到任意深度(尽管我们在这里将其限制为深度四):

Now it is easy to see who replied to Bob’s original email. First, locate the email of interest, then match against all incoming REPLY_TO relationships (there may be multiple replies), and from there match against incoming SENT relationships: this reveals the sender(s). In Cypher this is simple to express. In fact, Cypher makes it easy to look for replies-to-replies-to-replies, and so on to an arbitrary depth (though we limit it to depth four here):

MATCH p=(email:Email {id:'6'})<-[:REPLY_TO*1..4]-(:Reply)<-[:SENT]-(replier)
RETURN replier.username AS replier, length(p) - 1 AS depth
ORDER BY depth
MATCH p=(email:Email {id:'6'})<-[:REPLY_TO*1..4]-(:Reply)<-[:SENT]-(replier)
RETURN replier.username AS replier, length(p) - 1 AS depth
ORDER BY depth
格数据库 0312
图 3-12.高保真度的明确建模回复

在这里,我们捕获每个匹配的路径,并将其绑定到标识符pRETURN然后在子句中,我们计算回复链的长度(为SENT关系减 1),并返回回复者的姓名以及他或她回复的深度。此查询返回以下结果:

Here we capture each matched path, binding it to the identifier p. In the RETURN clause we then calculate the length of the reply-to chain (subtracting 1 for the SENT relationship), and return the replier’s name and the depth at which he or she replied. This query returns the following results:

+-------------------+
| 回复者 | 深度 |
+-------------------+
| “达维娜” | 1 |
| “鲍勃” | 1 |
| “查理” | 2 |
| “鲍勃” | 3 |
+-------------------+
4 行
+-------------------+
| replier   | depth |
+-------------------+
| "Davina"  | 1     |
| "Bob"     | 1     |
| "Charlie" | 2     |
| "Bob"     | 3     |
+-------------------+
4 rows

我们看到 Davina 和 Bob 都直接回复了 Bob 的原始电子邮件;Charlie 回复了其中一条回复;然后 Bob 又回复了一条回复的回复。

We see that both Davina and Bob replied directly to Bob’s original email; that Charlie replied to one of the replies; and that Bob then replied to one of the replies to a reply.

转发的电子邮件也有类似的模式,转发的电子邮件可以看作是一封新电子邮件,只是恰好包含原始电子邮件的部分文本。与回复情况一样,我们明确地对新电子邮件进行建模。我们还从转发的邮件中引用原始电子邮件,以便我们始终拥有详细而准确的出处数据。如果转发的邮件本身被转发,情况也是如此,依此类推。例如,如果 Alice(Bob 的另一个自我)给 Bob 发送电子邮件,试图建立单独的具体身份,然后 Bob(希望进行一些诡计)将该电子邮件转发给 Charlie,然后 Charlie 将其转发给 Davina,我们实际上有三封电子邮件需要考虑。假设用户(及其别名)已经在数据库中,在 Cypher 中,我们会将该审计信息写入数据库,如下所示:

It’s a similar pattern for a forwarded email, which can be regarded as a new email that simply happens to contain some of the text of the original email. As with the reply case, we model the new email explicitly. We also reference the original email from the forwarded mail so that we always have detailed and accurate provenance data. The same applies if the forwarded mail is itself forwarded, and so on. For example, if Alice (Bob’s alter ego) emails Bob to try to establish separate concrete identities, and then Bob (wishing to perform some subterfuge) forwards that email on to Charlie, who then forwards it onto Davina, we actually have three emails to consider. Assuming the users (and their aliases) are already in the database, in Cypher we’d write that audit information into the database as follows:

CREATE (email_11:Email {id:'11', content:'email'}),
       (alice)-[:SENT]->(email_11)-[:TO]->(bob);

CREATE (email_12:Email:Forward {id:'12', content:'email'}),
       (email_12)-[:FORWARD_OF]->(email_11),
       (bob)-[:SENT]->(email_12)-[:TO]->(charlie);

CREATE (email_13:Email:Forward {id:'13', content:'email'}),
       (email_13)-[:FORWARD_OF]->(email_12),
       (charlie)-[:SENT]->(email_13)-[:TO]->(davina);
CREATE (email_11:Email {id:'11', content:'email'}),
       (alice)-[:SENT]->(email_11)-[:TO]->(bob);

CREATE (email_12:Email:Forward {id:'12', content:'email'}),
       (email_12)-[:FORWARD_OF]->(email_11),
       (bob)-[:SENT]->(email_12)-[:TO]->(charlie);

CREATE (email_13:Email:Forward {id:'13', content:'email'}),
       (email_13)-[:FORWARD_OF]->(email_12),
       (charlie)-[:SENT]->(email_13)-[:TO]->(davina);

完成这些写入后,我们的数据库将包含图 3-13所示的子图。

On completion of these writes, our database will contain the subgraph shown in Figure 3-13.

格德布 0313
图 3-13.明确建模电子邮件转发

利用该图表,我们可以确定转发的电子邮件链的各种路径。

Using this graph, we can determine the various paths of a forwarded email chain.

MATCH (email:Email {id:'11'})<-[f:FORWARD_OF*]-(:Forward)
RETURN count(f)
MATCH (email:Email {id:'11'})<-[f:FORWARD_OF*]-(:Forward)
RETURN count(f)

FORWARD_OF此查询从给定的电子邮件开始,然后与转发电子邮件树中的所有传入关系进行匹配,深度不限。这些关系与标识符绑定f。要计算电子邮件被转发的次数,我们使用 Cypher函数计算FORWARD_OF绑定到的关系数。在此示例中,我们看到原始电子邮件已被fcount转发两次:

This query starts at the given email and then matches against all incoming FORWARD_OF relationships in the tree of forwarded emails to any depth. These relationships are bound to an identifier f. To calculate the number of times the email has been forwarded, we count the number of FORWARD_OF relationships bound to f using Cypher’s count function. In this example, we see the original email has been forwarded twice:

+----------+
| 计数(f) |
+----------+
| 2 |
+----------+
1 行
+----------+
| count(f) |
+----------+
| 2        |
+----------+
1 row

识别节点和关系

Identifying Nodes and Relationships

建模过程可以概括为尝试创建一个图形结构,以表达我们想要询问的领域问题。也就是说,可查询性设计:

The modeling process can best be summed up as an attempt to create a graph structure that expresses the questions we want to ask of our domain. That is, design for queryability:

  1. 描述激发我们模型的客户或最终用户目标。
  2. Describe the client or end-user goals that motivate our model.
  3. 将这些目标重写为我们领域要问的问题。
  4. Rewrite these goals as questions to ask of our domain.
  5. 识别这些问题中出现的实体和关系。
  6. Identify the entities and the relationship that appear in these questions.
  7. 将这些实体和关系转换为 Cypher 路径表达式。
  8. Translate these entities and relationships into Cypher path expressions.
  9. 使用类似于我们用于建模领域的路径表达式,将我们想要询问的领域问题表达为图形模式。
  10. Express the questions we want to ask of our domain as graph patterns using path expressions similar to the ones we used to model the domain.

通过检查我们用来描述领域的语言,我们可以非常快速地识别图中的核心元素:

By examining the language we use to describe our domain, we can very quickly identify the core elements in our graph:

  • 普通名词成为标签:例如“用户”和“电子邮件”成为标签UserEmail
  • Common nouns become labels: “user” and “email,” for example, become the labels User and Email.
  • 带宾语的动词成为关系名称:例如“sent”和“wrote”,become SENTand WROTE
  • Verbs that take an object become relationship names: “sent” and “wrote,” for example, become SENT and WROTE.
  • 专有名词(例如个人或公司的名称)指事物的一个实例,我们将其建模为一个节点,使用一个或多个属性来捕获该事物的属性。
  • A proper noun — a person or company’s name, for example — refers to an instance of a thing, which we model as a node, using one or more properties to capture that thing’s attributes.

避免反模式

Avoiding Anti-Patterns

一般情况下,不要将实体编码为关系。使用关系来传达实体如何关联的语义以及这些关系的质量。领域实体在语音中并不总是立即可见的,因此我们必须仔细考虑我们实际处理的名词动词化,即名词转化为动词的语言习惯,通常可以隐藏名词和相应领域实体的存在。技术和商业术语中尤其充斥着这样的新词:正如我们所见,我们“email”彼此,而不是发送电子邮件,“google”寻找结果,而不是搜索谷歌。

In the general case, don’t encode entities into relationships. Use relationships to convey semantics about how entities are related, and the quality of those relationships. Domain entities aren’t always immediately visible in speech, so we must think carefully about the nouns we’re actually dealing with. Verbing, the language habit whereby a noun is transformed into a verb, can often hide the presence of a noun and a corresponding domain entity. Technical and business jargon is particularly rife with such neologisms: as we’ve seen, we “email” one another, rather than send an email, “google” for results, rather than search Google.

认识到图是一种自然的附加结构也很重要。使用新节点和新关系添加领域实体及其相互关联的事实是很自然的,即使感觉我们正在用大量数据淹没数据库。一般来说,为了保持查询时间效率,试图在写入时合并数据元素是一种不好的做法。如果我们根据我们想要询问的数据问题进行建模,就会出现领域的准确表示。有了这个数据模型,我们可以相信图形数据库在读取时表现良好。

It’s also important to realize that graphs are a naturally additive structure. It’s quite natural to add facts in terms of domain entities and how they interrelate using new nodes and new relationships, even if it feels like we’re flooding the database with a great deal of data. In general, it’s bad practice to try to conflate data elements at write time to preserve query-time efficiency. If we model in accordance with the questions we want to ask of our data, an accurate representation of the domain will emerge. With this data model in place, we can trust the graph database to perform well at read time.


笔记

图形数据库即使在存储大量数据时也能保持快速查询时间。在学习构建图形而不对其进行非规范化时,学会信任我们的图形数据库非常重要。

Graph databases maintain fast query times even when storing vast amounts of data. Learning to trust our graph database is important when learning to structure our graphs without denormalizing them.


概括

Summary

图形数据库使软件专业人员能够使用图形来表示问题域,然后在运行时保留和查询该图形。我们可以使用图形来清楚地描述问题域;然后图形数据库允许我们以一种在域和数据之间保持高亲和力的方式存储这种表示。此外,图形建模消除了使用复杂的数据管理代码对数据进行规范化和非规范化的需要。

Graph databases give software professionals the power to represent a problem domain using a graph, and then persist and query that graph at runtime. We can use graphs to clearly describe a problem domain; graph databases then allow us to store this representation in a way that maintains high affinity between the domain and the data. Further, graph modeling removes the need to normalize and denormalize data using complex data management code.

然而,我们中的许多人对于使用图形建模都是新手。我们创建的图形应该适合查询,同时避免混淆实体和动作——这种不好的做法可能会丢失有用的领域知识。虽然图形建模没有绝对的正确或错误,但本章中的指导将帮助您创建图形数据,这些数据可以在多次迭代中满足您的系统需求,同时与代码演进保持同步。

Many of us, however, will be new to modeling with graphs. The graphs we create should read well for queries, while avoiding conflating entities and actions — bad practices that can lose useful domain knowledge. Although there are no absolute rights or wrongs to graph modeling, the guidance in this chapter will help you create graph data that can serve your systems’ needs over many iterations, all the while keeping pace with code evolution.

了解了图形数据建模后,您现在可能正在考虑开展图形数据库项目。在下一章中,我们将了解规划和交付图形数据库解决方案所涉及的内容。

Armed with an understanding of graph data modeling, you may now be considering undertaking a graph database project. In the next chapter we’ll look at what’s involved in planning and delivering a graph database solution.

1有关参考文档,请参阅http://goo.gl/W7Jh6xhttp://goo.gl/ftv8Gx

1 For reference documentation, see http://goo.gl/W7Jh6x and http://goo.gl/ftv8Gx.

第 4 章构建图形数据库应用程序

Chapter 4. Building a Graph Database Application

在本章中,我们讨论了使用图形数据库的一些实际问题。在前面的章节中,我们研究了图形数据;在本章中,我们将在开发图形数据库应用程序的背景下应用这些知识。我们将研究可能出现的一些数据建模问题,以及一些可用的应用程序架构选择。

In this chapter, we discuss some of the practical issues of working with a graph database. In previous chapters, we’ve looked at graph data; in this chapter, we’ll apply that knowledge in the context of developing a graph database application. We’ll look at some of the data modeling questions that may arise, and at some of the application architecture choices available to us.

根据我们的经验,图形数据库应用程序非常适合使用当今广泛使用的演进式、增量式和迭代式软件开发实践进行开发。这些实践的一个关键特征是在整个软件开发生命周期中普遍进行测试。在这里,我们将展示如何以测试驱动的方式开发我们的数据模型和应用程序。

In our experience, graph database applications are highly amenable to being developed using the evolutionary, incremental, and iterative software development practices in widespread use today. A key feature of these practices is the prevalence of testing throughout the software development life cycle. Here we’ll show how we develop our data model and our application in a test-driven fashion.

在本章的最后,我们将讨论生产规划时需要考虑的一些问题。

At the end of the chapter, we’ll look at some of the issues we’ll need to consider when planning for production.

数据建模

Data Modeling

我们在第 3 章中详细介绍了图形数据的建模和使用方法。在这里,我们总结了一些更重要的建模指南,并讨论了如何实现图形数据模型以适应迭代和增量软件开发技术。

We covered modeling and working with graph data in detail in Chapter 3. Here we summarize some of the more important modeling guidelines, and discuss how implementing a graph data model fits with iterative and incremental software development techniques.

根据应用程序的需求描述模型

Describe the Model in Terms of the Application’s Needs

我们需要询问数据的问题有助于识别实体和关系。敏捷用户故事提供了一种简洁的方法来表达应用程序需求的由外而内、以用户为中心的观点,以及在满足这种需求的过程中出现的问题。1以下图书评论 Web 应用程序的用户故事示例:

The questions we need to ask of the data help identify entities and relationships. Agile user stories provide a concise means for expressing an outside-in, user-centered view of an application’s needs, and the questions that arise in the course of satisfying this need.1 Here’s an example of a user story for a book review web application:

作为喜欢一本书的读者,我想知道喜欢同一本书的其他读者喜欢哪些书,以便我可以找到其他书来阅读。
AS A reader who likes a book, I WANT to know which books other readers who like the same book have liked, SO THAT I can find other books to read.

这个故事表达了用户的需求,这激发了我们数据模型的形状和内容。从数据建模的角度来看,该AS A子句建立了一个由两个实体(一个读者和一本书)以及LIKES连接它们的关系组成的上下文。I WANT然后,该子句提出了一个问题:喜欢我正在阅读的书的读者也喜欢哪些书?这个问题揭示了更多的LIKES关系和更多的实体:其他读者和其他书籍。

This story expresses a user need, which motivates the shape and content of our data model. From a data modeling point of view, the AS A clause establishes a context comprising two entities — a reader and a book — plus the LIKES relationship that connects them. The I WANT clause then poses a question: which books have the readers who like the book I’m currently reading also liked? This question exposes more LIKES relationships, and more entities: other readers and other books.

我们在分析用户故事时发现的实体和关系很快就能转化为一个简单的数据模型,如图4-1所示。

The entities and relationships that we’ve surfaced in analyzing the user story quickly translate into a simple data model, as shown in Figure 4-1.

格数据库 0401
图 4-1.书评用户故事的数据模型

因为这个数据模型直接对用户故事提出的问题进行了编码,所以它适合以类似反映我们想要询问数据的问题结构的方式进行查询,因为爱丽丝喜欢沙丘,所以找出其他喜欢沙丘的人喜欢的书

Because this data model directly encodes the question presented by the user story, it lends itself to being queried in a way that similarly reflects the structure of the question we want to ask of the data, since Alice likes Dune, find books that others who like Dune have enjoyed:

MATCH (:Reader {name:'Alice'})-[:LIKES]->(:Book {title:'Dune'})
      <-[:LIKES]-(:Reader)-[:LIKES]->(books:Book)
RETURN books.title
MATCH (:Reader {name:'Alice'})-[:LIKES]->(:Book {title:'Dune'})
      <-[:LIKES]-(:Reader)-[:LIKES]->(books:Book)
RETURN books.title

节点代表事物,关系代表结构

Nodes for Things, Relationships for Structure

虽然并非适用于每种情况,但这些一般准则将帮助我们选择何时使用节点,何时使用关系:

Though not applicable in every situation, these general guidelines will help us choose when to use nodes, and when to use relationships:

  • 使用节点来表示实体——即我们领域中我们感兴趣的 事物,并且可以被标记和分组。
  • Use nodes to represent entities — that is, the things in our domain that are of interest to us, and which can be labeled and grouped.
  • 使用关系来表达实体之间的联系并为每个实体建立语义上下文,从而构建领域。
  • Use relationships both to express the connections between entities and to establish semantic context for each entity, thereby structuring the domain.
  • 使用关系方向进一步明确关系语义。许多关系是不对称的,这就是为什么属性图中的关系总是有向的。对于双向关系,我们应该让查询忽略方向,而不是使用两个关系。
  • Use relationship direction to further clarify relationship semantics. Many relationships are asymmetrical, which is why relationships in a property graph are always directed. For bidirectional relationships, we should make our queries ignore direction, rather than using two relationships.
  • 使用节点属性来表示实体属性,以及任何必要的实体元数据,例如时间戳、版本号等。
  • Use node properties to represent entity attributes, plus any necessary entity metadata, such as timestamps, version numbers, etc.
  • 使用关系属性来表达关系的强度、权重或质量,以及任何必要的关系元数据,例如时间戳、版本号等。
  • Use relationship properties to express the strength, weight, or quality of a relationship, plus any necessary relationship metadata, such as timestamps, version numbers, etc.

勤于发现和捕获领域实体是有好处的。正如我们在第 3 章中看到的那样,使用随意命名的关系来建模那些实际上应该表示为节点的东西相对容易。如果我们想使用关系来建模实体(例如电子邮件或评论),我们必须确保该实体不能与两个以上的其他实体相关。请记住,关系必须有一个起始节点和一个终止节点,不多也不少。如果我们后来发现需要将我们建模为关系的东西连接到两个以上的其他实体,我们就必须将关系中的实体重构为一个单独的节点。这对数据模型来说是一个重大更改,可能会要求我们更改生成或使用数据的任何查询和应用程序代码。

It pays to be diligent about discovering and capturing domain entities. As we saw in Chapter 3, it’s relatively easy to model things that really ought to be represented as nodes using carelessly named relationships instead. If we’re tempted to use a relationship to model an entity — an email, or a review, for example — we must make certain that this entity cannot be related to more than two other entities. Remember, a relationship must have a start node and an end node — nothing more, nothing less. If we find later that we need to connect something we’ve modeled as a relationship to more than two other entities, we’ll have to refactor the entity inside the relationship out into a separate node. This is a breaking change to the data model, and will likely require us to make changes to any queries and application code that produce or consume the data.

细粒度关系与通用关系

Fine-Grained versus Generic Relationships

在设计关系时,我们应注意使用细粒度关系名称与使用属性限定的通用关系之间的权衡。这是使用DELIVERY_ADDRESSHOME_ADDRESSADDRESS {type:'delivery'}和之间的区别ADDRESS {type:'home'}

When designing relationships we should be mindful of the trade-offs between using fine-grained relationship names versus generic relationships qualified with properties. It’s the difference between using DELIVERY_ADDRESS and HOME_ADDRESS versus ADDRESS {type:'delivery'} and ADDRESS {type:'home'}.

关系是进入图谱的必经之路。通过关系名称进行区分是从遍历中消除图谱大片的最佳方法。使用一个或多个属性值来决定是否遵循关系会在第一次访问这些属性时产生额外的 I/O,因为这些属性位于与关系不同的存储文件中(但此后,它们会被缓存)。

Relationships are the royal road into the graph. Differentiating by relationship name is the best way of eliminating large swathes of the graph from a traversal. Using one or more property values to decide whether or not to follow a relationship incurs extra I/O the first time those properties are accessed because the properties reside in a separate store file from the relationships (after that, however, they’re cached).

每当我们有一组封闭的关系名称时,我们就会使用细粒度关系。权重(最短加权路径算法所要求的)很少包含一个封闭集,通常最好以关系上的属性来表示。

We use fine-grained relationships whenever we have a closed set of relationship names. Weightings — as required by a shortest-weighted-path algorithm — rarely comprise a closed set, and are usually best represented as properties on relationships.

然而,有时我们有一组封闭的关系,但是在某些遍历中我们希望遵循该集合内特定类型的关系,而在其他遍历中我们希望遵循所有关系,而不管类型如何。地址就是一个很好的例子。遵循封闭集原则,我们可以选择创建HOME_ADDRESSWORK_ADDRESSDELIVERY_ADDRESS关系。这使我们能够遵循特定类型的地址关系(DELIVERY_ADDRESS例如)而忽略所有其余关系。但是,如果我们想找到用户的所有地址,我们该怎么做?这里有几个选择。首先,我们可以在查询中编码所有不同关系类型的知识:例如MATCH (user)-[:HOME_ADDRESS|WORK_ADDRESS|DELIVERY_ADDRESS]->(address)。但是,当存在许多不同类型的关系时,这很快就会变得难以处理。或者,ADDRESS除了细粒度关系之外,我们还可以向我们的模型添加更通用的关系。然后,每个表示地址的节点都使用两种关系连接到用户:细粒度关系(例如DELIVERY_ADDRESS更通用的ADDRESS {type:'delivery'}关系。

Sometimes, however, we have a closed set of relationships, but in some traversals we want to follow specific kinds of relationships within that set, whereas in others we want to follow all of them, irrespective of type. Addresses are a good example. Following the closed-set principle, we might choose to create HOME_ADDRESS, WORK_ADDRESS, and DELIVERY_ADDRESS relationships. This allows us to follow specific kinds of address relationships (DELIVERY_ADDRESS, for example) while ignoring all the rest. But what do we do if we want to find all addresses for a user? There are a couple of options here. First, we can encode knowledge of all the different relationship types in our queries: e.g., MATCH (user)-[:HOME_ADDRESS|WORK_ADDRESS|DELIVERY_ADDRESS]->(address). This, however, quickly becomes unwieldy when there are lots of different kinds of relationships. Alternatively, we can add a more generic ADDRESS relationship to our model, in addition to the fine-grained relationships. Every node representing an address is then connected to a user using two relationships: a fined-grained relationship (e.g., DELIVERY_ADDRESS) and the more generic ADDRESS {type:'delivery'} relationship.

正如我们在“根据应用程序的需求描述模型”中讨论的那样,这里的关键是让我们想要询问的数据问题指导我们引入模型的关系类型。

As we discussed in “Describe the Model in Terms of the Application’s Needs”, the key here is to let the questions we want to ask of our data guide the kinds of relationships we introduce into the model.

将事实模型化为节点

Model Facts as Nodes

当两个或多个领域实体在一段时间内交互时,就会出现一个事实。我们将事实表示为一个单独的节点,该节点与参与该事实的每个实体都有连接。根据其产品(即根据由该操作产生的东西)对操作进行建模会产生类似的结构:一个中间节点,表示两个或多个实体之间交互的结果。我们可以在这个中间节点上使用时间戳属性来表示开始和结束时间。

When two or more domain entities interact for a period of time, a fact emerges. We represent a fact as a separate node with connections to each of the entities engaged in that fact. Modeling an action in terms of its product — that is, in terms of the thing that results from the action — produces a similar structure: an intermediate node that represents the outcome of an interaction between two or more entities. We can use timestamp properties on this intermediate node to represent start and end times.

以下示例展示了如何使用中间节点来建模事实和动作。

The following examples show how we might model facts and actions using intermediate nodes.

就业

Employment

图 4-2显示了 Ian 受雇于 Neo Technology 并担任工程师的事实如何在图中表示出来。

Figure 4-2 shows how the fact of Ian being employed by Neo Technology in the role of engineer can be represented in the graph.

在 Cypher 中,这可以表示为:

In Cypher, this can be expressed as:

CREATE (:Person {name:'Ian'})-[:EMPLOYMENT]->
        (employment:Job {start_date:'2011-01-05'})
        -[:EMPLOYER]->(:Company {name:'Neo'}),
       (employment)-[:ROLE]->(:Role {name:'engineer'})
CREATE (:Person {name:'Ian'})-[:EMPLOYMENT]->
        (employment:Job {start_date:'2011-01-05'})
        -[:EMPLOYER]->(:Company {name:'Neo'}),
       (employment)-[:ROLE]->(:Role {name:'engineer'})
格数据库 0402
图 4-2。Ian最初在 Neo Technology 担任工程师

表现

Performance

图 4-3显示了威廉·哈特内尔 (William Hartnell) 在故事《The Sensorites》中饰演医生这一事实如何在图表中体现出来。

Figure 4-3 shows how the fact that William Hartnell played The Doctor in the story The Sensorites can be represented in the graph.

在 Cypher 中:

In Cypher:

CREATE (:Actor {name:'William Hartnell'})-[:PERFORMED_IN]->
         (performance:Performance {year:1964})-[:PLAYED]->
         (:Role {name:'The Doctor'}),
       (performance)-[:FOR]->(:Story {title:'The Sensorites'})
CREATE (:Actor {name:'William Hartnell'})-[:PERFORMED_IN]->
         (performance:Performance {year:1964})-[:PLAYED]->
         (:Role {name:'The Doctor'}),
       (performance)-[:FOR]->(:Story {title:'The Sensorites'})
格数据库 0403
图 4-3。威廉·哈特内尔在故事《感官者》中扮演医生

电子邮件

Emailing

图 4-4显示了 Ian 给 Jim 发送电子邮件并抄送给 Alistair 的行为。

Figure 4-4 shows the act of Ian emailing Jim and copying in Alistair.

格数据库 0404
图 4-4. Ian 给 Jim 发了一封电子邮件,并抄送给 Alistair

在 Cypher 中,这可以表示为:

In Cypher, this can be expressed as:

CREATE (:Person {name:'Ian'})-[:SENT]->(e:Email {content:'...'})
         -[:TO]->(:Person {name:'Jim'}),
       (e)-[:CC]->(:Person {name:'Alistair'})
CREATE (:Person {name:'Ian'})-[:SENT]->(e:Email {content:'...'})
         -[:TO]->(:Person {name:'Jim'}),
       (e)-[:CC]->(:Person {name:'Alistair'})

审阅

Reviewing

图 4-5显示了 Alistair 评论一部电影的行为如何以图表形式表示。

Figure 4-5 shows how the act of Alistair reviewing a film can be represented in the graph.

在 Cypher 中:

In Cypher:

CREATE (:Person {name:'Alistair'})-[:WROTE]->
         (review:Review {text:'...'})-[:OF]->(:Film {title:'...'}),
       (review)-[:PUBLISHED_IN]->(:Publication {title:'...'})
CREATE (:Person {name:'Alistair'})-[:WROTE]->
         (review:Review {text:'...'})-[:OF]->(:Film {title:'...'}),
       (review)-[:PUBLISHED_IN]->(:Publication {title:'...'})
格数据库 0405
图 4-5. Alistair 写了一篇电影评论,发表在杂志上

将复杂值类型表示为节点

Represent Complex Value Types as Nodes

值类型是没有身份的事物,其等价性仅基于其值。示例包括moneyaddressSKU。复杂值类型是具有多个字段或属性的值类型。例如, Address是一种复杂值类型。这种多属性值类型可以有用地表示为单独的节点:

Value types are things that do not have an identity, and whose equivalence is based solely on their values. Examples include money, address, and SKU. Complex value types are value types with more than one field or property. Address, for example, is a complex value type. Such multiproperty value types may be usefully represented as separate nodes:

MATCH (:Order {orderid:13567})-[:DELIVERY_ADDRESS]->(address:Address)
RETURN address.first_line, address.zipcode
MATCH (:Order {orderid:13567})-[:DELIVERY_ADDRESS]->(address:Address)
RETURN address.first_line, address.zipcode

时间

Time

在图表中,时间可以用几种不同的方式建模。这里我们描述了两种技术:时间线树和链接列表。在某些解决方案中,将这两种技术结合起来很有用。

Time can be modeled in several different ways in the graph. Here we describe two techniques: timeline trees and linked lists. In some solutions, it’s useful to combine these two techniques.

时间线树

Timeline trees

如果我们需要找出特定时间段内发生的所有事件,可以构建时间线树,如图4-6所示。

If we need to find all the events that have occurred over a specific period, we can build a timeline tree, as shown in Figure 4-6.

格数据库 0406
图 4-6.显示电视节目四集播出日期的时间轴树

每年都有自己的一组月份节点;每个月都有自己的一组日期节点。我们只需在需要时将节点插入时间轴树中。假设根节点timeline已被索引,或者可以通过遍历图发现,以下 Cypher 语句可确保特定事件的所有必要节点和关系(年、月、日,以及代表事件本身的节点)已经存在于图中,或者如果不存在,则已添加到图中(MERGE将添加任何缺失的元素):

Each year has its own set of month nodes; each month has its own set of day nodes. We need only insert nodes into the timeline tree as and when they are needed. Assuming the root timeline node has been indexed, or can be discovered by traversing the graph, the following Cypher statement ensures that all necessary nodes and relationships for a particular event — year, month, day, plus the node representing the event itself — are either already present in the graph, or, if not present, are added to the graph (MERGE will add any missing elements):

MATCH (timeline:Timeline {name:{timelineName}})
MERGE (episode:Episode {name:{newEpisode}})
MERGE (timeline)-[:YEAR]->(year:Year {value:{year}})
MERGE (year)-[:MONTH]->(month:Month {name:{monthName}})
MERGE (month)-[:DAY]->(day:Day {value:{day}, name:{dayName}})
MERGE (day)<-[:BROADCAST_ON]-(episode)
MATCH (timeline:Timeline {name:{timelineName}})
MERGE (episode:Episode {name:{newEpisode}})
MERGE (timeline)-[:YEAR]->(year:Year {value:{year}})
MERGE (year)-[:MONTH]->(month:Month {name:{monthName}})
MERGE (month)-[:DAY]->(day:Day {value:{day}, name:{dayName}})
MERGE (day)<-[:BROADCAST_ON]-(episode)

可以使用以下 Cypher 代码查询日历中开始日期(含)和结束日期(不含)之间的所有事件:

Querying the calendar for all events between a start date (inclusive) and an end date (exclusive) can be done with the following Cypher code:

MATCH (timeline:Timeline {name:{timelineName}})
MATCH (timeline)-[:YEAR]->(year:Year)-[:MONTH]->(month:Month)-[:DAY]->
      (day:Day)<-[:BROADCAST_ON]-(n)
WHERE ((year.value > {startYear} AND year.value < {endYear})
      OR ({startYear} = {endYear} AND {startMonth} = {endMonth}
          AND year.value = {startYear} AND month.value = {startMonth}
          AND day.value >= {startDay} AND day.value < {endDay})
      OR ({startYear} = {endYear} AND {startMonth} < {endMonth}
          AND year.value = {startYear}
          AND ((month.value = {startMonth} AND day.value >= {startDay})
              OR (month.value > {startMonth} AND month.value < {endMonth})
              OR (month.value = {endMonth} AND day.value < {endDay})))
      OR ({startYear} < {endYear}
          AND year.value = {startYear}
          AND ((month.value > {startMonth})
              OR (month.value = {startMonth} AND day.value >= {startDay})))
      OR ({startYear} < {endYear}
          AND year.value = {endYear}
          AND ((month.value < {endMonth})
              OR (month.value = {endMonth} AND day.value < {endDay}))))
RETURN n
MATCH (timeline:Timeline {name:{timelineName}})
MATCH (timeline)-[:YEAR]->(year:Year)-[:MONTH]->(month:Month)-[:DAY]->
      (day:Day)<-[:BROADCAST_ON]-(n)
WHERE ((year.value > {startYear} AND year.value < {endYear})
      OR ({startYear} = {endYear} AND {startMonth} = {endMonth}
          AND year.value = {startYear} AND month.value = {startMonth}
          AND day.value >= {startDay} AND day.value < {endDay})
      OR ({startYear} = {endYear} AND {startMonth} < {endMonth}
          AND year.value = {startYear}
          AND ((month.value = {startMonth} AND day.value >= {startDay})
              OR (month.value > {startMonth} AND month.value < {endMonth})
              OR (month.value = {endMonth} AND day.value < {endDay})))
      OR ({startYear} < {endYear}
          AND year.value = {startYear}
          AND ((month.value > {startMonth})
              OR (month.value = {startMonth} AND day.value >= {startDay})))
      OR ({startYear} < {endYear}
          AND year.value = {endYear}
          AND ((month.value < {endMonth})
              OR (month.value = {endMonth} AND day.value < {endDay}))))
RETURN n

这里的子句WHERE虽然有些冗长,但只是根据查询提供的开始和结束日期过滤每个匹配项。

The WHERE clause here, though somewhat verbose, simply filters each match based on the start and end dates supplied to the query.

链接列表

Linked lists

许多事件与发生在它们之前和之后的事件具有时间关系。我们可以使用NEXT和/或PREVIOUS关系(取决于我们的偏好)来创建捕获这种自然顺序的链接列表,如图4-7所示。2链接列表允许非常快速地遍历按时间顺序排列的事件。

Many events have temporal relationships to the events that precede and follow them. We can use NEXT and/or PREVIOUS relationships (depending on our preference) to create linked lists that capture this natural ordering, as shown in Figure 4-7.2 Linked lists allow for very rapid traversal of time-ordered events.

格德布 0407
图 4-7.表示按时间顺序排列的一系列事件的双向链表

版本控制

Versioning

版本化图使我们能够恢复特定时间点的图状态。大多数图数据库不支持将版本控制作为首要概念。但是,可以图模型中创建版本控制方案。使用此方案,节点和关系会在修改时加盖时间戳并存档。3此类版本控制方案的缺点是,它们会渗透到针对图编写的任何查询中,从而给最简单的查询增加一层复杂性。

A versioned graph enables us to recover the state of the graph at a particular point in time. Most graph databases don’t support versioning as a first-class concept. It is possible, however, to create a versioning scheme inside the graph model. With this scheme nodes and relationships are timestamped and archived whenever they are modified.3 The downside of such versioning schemes is that they leak into any queries written against the graph, adding a layer of complexity to even the simplest query.

迭代和增量开发

Iterative and Incremental Development

我们逐个功能、逐个用户故事开发数据模型。这将确保我们识别应用程序将用于查询图表的关系。根据应用程序功能的迭代和增量交付而开发的数据模型与使用数据模型优先方法制定的数据模型看起来截然不同,但它将是正确的模型,始终受到应用程序需求以及与这些需求相关的问题的驱动。

We develop the data model feature by feature, user story by user story. This will ensure that we identify the relationships our application will use to query the graph. A data model that is developed in line with the iterative and incremental delivery of application features will look quite different from one drawn up using a data model-first approach, but it will be the correct model, motivated throughout by the application’s needs, and the questions that arise in conjunction with those needs.

图形数据库可确保数据模型的平稳演进。迁移和非规范化很少成为问题。新事实和新组合成为新节点和关系,而针对性能关键型访问模式进行优化通常涉及在两个节点之间引入直接关系,否则这两个节点只能通过中介进行连接。与我们在关系世界中采用的优化策略不同,后者通常涉及非规范化,从而损害高保真模型,这不是一个非此即彼的问题:要么是详细的、高度规范化的结构,要么是高性能的妥协。使用图形,我们保留了原始的高保真图形结构,同时用满足新需求的新元素丰富它。

Graph databases provide for the smooth evolution of our data model. Migrations and denormalization are rarely an issue. New facts and new compositions become new nodes and relationships, while optimizing for performance-critical access patterns typically involves introducing a direct relationship between two nodes that would otherwise be connected only by way of intermediaries. Unlike the optimization strategies we employ in the relational world, which typically involve denormalizing and thereby compromising a high-fidelity model, this is not an either/or issue: either the detailed, highly normalized structure, or the high performance compromise. With the graph we retain the original high-fidelity graph structure, while at the same time enriching it with new elements that cater to new needs.

我们将很快看到不同的关系如何并存,满足不同的需求,而不会扭曲模型以偏向任何特定需求。地址有助于说明这一点。例如,假设我们正在开发一个零售应用程序。在开发履行案例时,我们添加了将包裹派送到客户送货地址的功能,我们使用以下查询找到该地址:

We will quickly see how different relationships can sit side-by-side with one another, catering to different needs without distorting the model in favor of any one particular need. Addresses help illustrate the point here. Imagine, for example, that we are developing a retail application. While developing a fulfillment story, we add the ability to dispatch a parcel to a customer’s delivery address, which we find using the following query:

MATCH (user:User {id:{userId}})
MATCH (user)-[:DELIVERY_ADDRESS]->(address:Address)
RETURN address
MATCH (user:User {id:{userId}})
MATCH (user)-[:DELIVERY_ADDRESS]->(address:Address)
RETURN address

稍后,在添加一些计费功能时,我们引入了一种BILLING_ADDRESS关系。稍后,我们又增加了客户管理所有地址的功能。最后一个功能要求我们查找所有地址 — 无论是送货地址、账单地址还是其他地址。为了方便起见,我们引入了一种通用ADDRESS关系:

Later on, when adding some billing functionality, we introduce a BILLING_ADDRESS relationship. Later still, we add the ability for customers to manage all their addresses. This last feature requires us to find all addresses — whether delivery, billing, or some other address. To facilitate this, we introduce a general ADDRESS relationship:

MATCH (user:User {id:{userId}})
MATCH (user)-[:ADDRESS]->(address:Address)
RETURN address
MATCH (user:User {id:{userId}})
MATCH (user)-[:ADDRESS]->(address:Address)
RETURN address

此时,我们的数据模型看起来应该类似于图 4-8中所示的模型。DELIVERY_ADDRESS专门用于代表应用程序的履行需求的数据;BILLING_ADDRESS专门用于代表应用程序的计费需求的数据;并且ADDRESS专门用于代表应用程序的客户管理需求的数据。

By this time, our data model looks something like the one shown in Figure 4-8. DELIVERY_ADDRESS specializes the data on behalf of the application’s fulfillment needs; BILLING_ADDRESS specializes the data on behalf of the application’s billing needs; and ADDRESS specializes the data on behalf of the application’s customer management needs.

格数据库 0408
图 4-8。不同应用需求的不同关系

仅仅因为我们可以添加新的关系来满足新的应用目标,并不意味着我们总是必须这样做。我们总是会在过程中发现重构模型的机会。例如,有很多次,现有关系足以满足新查询,或者重命名现有关系将允许它用于两种不同的需求。当这些机会出现时,我们应该抓住它们。如果我们以测试驱动的方式开发解决方案(本章后面将更详细地描述),我们将拥有一套完善的回归测试。这些测试让我们有信心对模型进行重大更改。

Just because we can add new relationships to meet new application goals, doesn’t mean we always have to do this. We’ll invariably identify opportunities for refactoring the model as we go. There’ll be plenty of times, for example, where an existing relationship will suffice for a new query, or where renaming an existing relationship will allow it to be used for two different needs. When these opportunities arise, we should take them. If we’re developing our solution in a test-driven manner — described in more detail later in this chapter — we’ll have a sound suite of regression tests in place. These tests give us the confidence to make substantial changes to the model.

应用程序架构

Application Architecture

在规划基于图形数据库的解决方案时,需要做出几个架构决策。这些决策会根据我们选择的数据库产品而略有不同。在本节中,我们将描述使用 Neo4j 时可用的一些架构选择以及相应的应用程序架构。

In planning a graph database-based solution, there are several architectural decisions to be made. These decisions will vary slightly depending on the database product we’ve chosen. In this section, we’ll describe some of the architectural choices, and the corresponding application architectures, available to us when using Neo4j.

嵌入式与服务器

Embedded versus Server

如今,大多数数据库都以服务器的形式运行,可通过客户端库进行访问。Neo4j 有点不同寻常,因为它既可以以嵌入式模式运行,也可以以服务器模式运行 — 事实上,追溯到近十年前,它的起源就是嵌入式图形数据库。

Most databases today run as a server that is accessed through a client library. Neo4j is somewhat unusual in that it can be run in embedded as well as server mode — in fact, going back nearly ten years, its origins are as an embedded graph database.


笔记

嵌入式数据库与内存数据库不同。Neo4j 的嵌入式实例仍使所有数据在磁盘上持久化。稍后,在“测试”中,我们将讨论ImpermanentGraphDatabase,这是专为测试目的而设计的 Neo4j 的内存版本。

An embedded database is not the same as an in-memory database. An embedded instance of Neo4j still makes all data durable on disk. Later, in “Testing”, we’ll discuss ImpermanentGraphDatabase, which is an in-memory version of Neo4j designed for testing purposes.


嵌入式 Neo4j

Embedded Neo4j

在嵌入模式下,Neo4j 与我们的应用程序在同一进程中运行。嵌入式 Neo4j 非常适合硬件设备、桌面应用程序以及集成到我们自己的应用程序服务器中。嵌入模式的一些优点包括:

In embedded mode, Neo4j runs in the same process as our application. Embedded Neo4j is ideal for hardware devices, desktop applications, and for incorporating in our own application servers. Some of the advantages of embedded mode include:

低延迟
因为我们的应用程序直接与数据库对话,没有网络开销。
Because our application speaks directly to the database, there’s no network overhead.
API 的选择
我们可以访问用于创建和查询数据的所有 API:核心 API、遍历框架和 Cypher 查询语言。
We have access to the full range of APIs for creating and querying data: the Core API, Traversal Framework, and the Cypher query language.
显式交易
使用通过核心 API,我们可以控制事务生命周期,在单个事务上下文中对数据库执行任意复杂的命令序列。Java API 还公开了事务生命周期,使我们能够插入自定义事务事件处理程序,以便在每个事务中执行附加逻辑。
Using the Core API, we can control the transactional life cycle, executing an arbitrarily complex sequence of commands against the database in the context of a single transaction. The Java APIs also expose the transaction life cycle, enabling us to plug in custom transaction event handlers that execute additional logic with each transaction.

但是,在嵌入模式下运行时,我们应该牢记以下几点:

When running in embedded mode, however, we should bear in mind the following:

仅限 JVM(Java 虚拟机)
Neo4j 是一个基于 JVM 的数据库。因此,它的许多 API 只能通过基于 JVM 的语言访问。
Neo4j is a JVM-based database. Many of its APIs are, therefore, accessible only from a JVM-based language.
GC 行为
什么时候在嵌入式模式下运行,Neo4j 会受到主机应用程序的垃圾收集 (GC) 行为的影响。长时间的 GC 暂停会影响查询时间。此外,当将嵌入式实例作为 HA(高可用性)集群的一部分运行时,长时间的 GC 暂停可能会导致集群协议触发主节点重新选举。
When running in embedded mode, Neo4j is subject to the garbage collection (GC) behaviors of the host application. Long GC pauses can affect query times. Further, when running an embedded instance as part of an HA (high-availability) cluster, long GC pauses can cause the cluster protocol to trigger a master reelection.
数据库生命周期
应用程序负责控制数据库生命周期,包括安全地启动和关闭数据库。
The application is responsible for controlling the database life cycle, which includes starting and closing it safely.

嵌入式 Neo4j 可以像服务器版本一样进行集群化,以实现高可用性和水平读取扩展。事实上,我们可以运行嵌入式和服务器实例的混合集群(集群是在数据库级别而不是服务器级别执行的)。这在企业集成场景中很常见,其中来自其他系统的定期更新针对嵌入式实例执行,然后复制到服务器实例。

Embedded Neo4j can be clustered for high availability and horizontal read scaling just as the server version. In fact, we can run a mixed cluster of embedded and server instances (clustering is performed at the database level, rather than the server level). This is common in enterprise integration scenarios, where regular updates from other systems are executed against an embedded instance, and then replicated out to server instances.

服务器模式

Server mode

在服务器模式下运行 Neo4j 是当今部署数据库的最常见方式。每个服务器的核心都是一个嵌入的 Neo4j 实例。服务器模式的一些好处模式包括:

Running Neo4j in server mode is the most common means of deploying the database today. At the heart of each server is an embedded instance of Neo4j. Some of the benefits of server mode include:

REST API
服务器公开了一个丰富的 REST API,允许客户端通过 HTTP 发送 JSON 格式的请求。响应包括 JSON 格式的文档,其中包含宣传数据集其他功能的超媒体链接。REST API 可由最终用户扩展,并支持执行 Cypher 查询。
The server exposes a rich REST API that allows clients to send JSON-formatted requests over HTTP. Responses comprise JSON-formatted documents enriched with hypermedia links that advertise additional features of the dataset. The REST API is extensible by end users and supports the execution of Cypher queries.
平台独立性
因为由于访问是通过 HTTP 发送的 JSON 格式的文档,因此几乎任何平台上运行的客户端都可以访问 Neo4j 服务器。所需的只是一个 HTTP 客户端库。4
Because access is by way of JSON-formatted documents sent over HTTP, a Neo4j server can be accessed by a client running on practically any platform. All that’s needed is an HTTP client library.4
扩展独立性
Neo4j 在服务器模式下运行,我们可以独立于应用服务器集群扩展数据库集群。
With Neo4j running in server mode, we can scale our database cluster independently of our application server cluster.
与应用程序 GC 行为隔离
服务器模式可防止 Neo4j 受到由应用程序其余部分触发的任何不良 GC 行为的影响。当然,Neo4j 仍会产生一些垃圾,但它对垃圾收集器的影响在开发过程中得到了仔细监控和调整,以减轻任何重大的副作用。但是,由于服务器扩展使我们能够在服务器内运行任意 Java 代码(请参阅“服务器扩展”),因此使用服务器扩展可能会影响服务器的 GC 行为。
In server mode, Neo4j is protected from any untoward GC behaviors triggered by the rest of the application. Of course, Neo4j still produces some garbage, but its impact on the garbage collector has been carefully monitored and tuned during development to mitigate any significant side effects. However, because server extensions enable us to run arbitrary Java code inside the server (see “Server extensions”), the use of server extensions may impact the server’s GC behavior.

在服务器模式下使用Neo4j时,我们应该牢记以下几点:

When using Neo4j in server mode, we should bear in mind the following:

网络开销
那里每个 HTTP 请求都会产生一些通信开销,尽管这相当小。在第一个客户端请求之后,TCP 连接保持打开状态,直到客户端关闭为止。
There is some communication overhead to each HTTP request, though it’s fairly minimal. After the first client request, the TCP connection remains open until closed by the client.
交易状态
Neo4j 服务器具有事务性 Cypher 端点。这允许客户端在单个事务的上下文中执行一系列 Cypher 语句。对于每个请求,客户端都会延长其对事务的租约。如果客户端因任何原因未能完成或回滚事务,则此事务状态将保留在服务器上,直到超时(默认情况下,服务器将在 60 秒后回收孤立事务)。对于需要单个事务上下文的更复杂、多步骤的操作,我们应该考虑使用服务器扩展(请参阅“服务器扩展”)。
Neo4j server has a transactional Cypher endpoint. This allows the client to execute a series of Cypher statements in the context of a single transaction. With each request, the client extends its lease on the transaction. If the client fails to complete or roll back the transaction for any reason, this transactional state will remain on the server until it times out (by default, the server will reclaim orphaned transactions after 60 seconds). For more complex, multistep operations requiring a single transactional context, we should consider using a server extension (see “Server extensions”).

如前所述,访问 Neo4j 服务器通常通过其 REST API 进行。REST API包括通过 HTTP 传输 JSON 格式的文档。使用 REST API,我们可以提交 Cypher 查询、配置命名索引并执行多个内置图形算法。我们还可以提交 JSON 格式的遍历描述并执行批处理操作。对于大多数用例,REST API 就足够了;但是,如果我们需要做一些目前无法使用 REST API 完成的事情,我们应该考虑开发服务器扩展。

Access to Neo4j server is typically by way of its REST API, as discussed previously. The REST API comprises JSON-formatted documents over HTTP. Using the REST API we can submit Cypher queries, configure named indexes, and execute several of the built-in graph algorithms. We can also submit JSON-formatted traversal descriptions, and perform batch operations. For the majority of use cases the REST API is sufficient; however, if we need to do something we cannot currently accomplish using the REST API, we should consider developing a server extension.

服务器扩展

Server extensions

服务器扩展使我们能够在服务器内运行 Java 代码。使用服务器扩展,我们可以扩展 REST API,或完全替换它。

Server extensions enable us to run Java code inside the server. Using server extensions, we can extend the REST API, or replace it entirely.

扩展采用JAX-RS 注释类的形式。JAX -RS是用于构建 RESTful 资源的 Java API。使用 JAX-RS 注释,我们修饰每个扩展类以向服务器指示它处理哪些 HTTP 请求。其他注释控制请求和响应格式、HTTP 标头以及 URI 模板的格式。

Extensions take the form of JAX-RS annotated classes. JAX-RS is a Java API for building RESTful resources. Using JAX-RS annotations, we decorate each extension class to indicate to the server which HTTP requests it handles. Additional annotations control request and response formats, HTTP headers, and the formatting of URI templates.

这是一个简单的服务器扩展的实现,允许客户端请求社交网络中两个成员之间的距离:

Here’s an implementation of a simple server extension that allows a client to request the distance between two members of a social network:

@Path("/distance")
public class SocialNetworkExtension
{
    private final GraphDatabaseService db;

    public SocialNetworkExtension( @Context GraphDatabaseService db )
    {
        this.db = db;
    }

    @GET
    @Produces("text/plain")
    @Path("/{name1}/{name2}")
    public String getDistance  ( @PathParam("name1") String name1,
                                 @PathParam("name2") String name2 )
    {
        String query = "MATCH (first:User {name:{name1}}),\n" +
                "(second:User {name:{name2}})\n" +
                "MATCH p=shortestPath(first-[*..4]-second)\n" +
                "RETURN length(p) AS depth";

        Map<String, Object> params = new HashMap<String, Object>();
        params.put( "name1", name1 );
        params.put( "name2", name2 );

        Result result = db.execute( query, params );

        return String.valueOf( result.columnAs( "depth" ).next() );
    }
}
@Path("/distance")
public class SocialNetworkExtension
{
    private final GraphDatabaseService db;

    public SocialNetworkExtension( @Context GraphDatabaseService db )
    {
        this.db = db;
    }

    @GET
    @Produces("text/plain")
    @Path("/{name1}/{name2}")
    public String getDistance  ( @PathParam("name1") String name1,
                                 @PathParam("name2") String name2 )
    {
        String query = "MATCH (first:User {name:{name1}}),\n" +
                "(second:User {name:{name2}})\n" +
                "MATCH p=shortestPath(first-[*..4]-second)\n" +
                "RETURN length(p) AS depth";

        Map<String, Object> params = new HashMap<String, Object>();
        params.put( "name1", name1 );
        params.put( "name2", name2 );

        Result result = db.execute( query, params );

        return String.valueOf( result.columnAs( "depth" ).next() );
    }
}

这里特别有趣的是各种注释:

Of particular interest here are the various annotations:

  • @Path("/distance")指定此扩展将响应指向以/distance开头的相对 URI 的请求。
  • @Path("/distance") specifies that this extension will respond to requests directed to relative URIs beginning /distance.
  • @Path("/{name1}/{name2}")上的注释 进一步getDistance()限定了与此扩展关联的 URI 模板。此处的片段与/distance连接以生成/distance/{name1}/{name2},其中 {name1} 和 {name2} 是出现在正斜杠之间的任何字符的占位符。稍后,在“测试服务器扩展”中,我们将在/socnet相对 URI 下注册此扩展。此时,路径的这几个不同部分确保指向以/socnet/distance/{name1}/{name2} 开头的相对 URI 的 HTTP 请求(例如,http://localhost/socnet/distance/Ben/Mike)将被分派到此扩展的实例。
  • The @Path("/{name1}/{name2}") annotation on getDistance() further qualifies the URI template associated with this extension. The fragment here is concatenated with /distance to produce /distance/{name1}/{name2}, where {name1} and {name2} are placeholders for any characters occurring between the forward slashes. Later on, in “Testing server extensions”, we’ll register this extension under the /socnet relative URI. At that point, these several different parts of the path ensure that HTTP requests directed to a relative URI beginning /socnet/distance/{name1}/{name2} (for example, http://localhost/socnet/distance/Ben/Mike) will be dispatched to an instance of this extension.
  • @GET指定getDistance()仅当请求为 HTTP GET 时才应调用。@Produces指示响应实体主体将被格式化为text/plain
  • @GET specifies that getDistance() should be invoked only if the request is an HTTP GET. @Produces indicates that the response entity body will be formatted as text/plain.
  • @PathParam参数前面的 两个注释用于将{name1}{name2}getDistance()路径占位符的内容映射到方法和参数。给定 URI http://localhost/socnet/distance/Ben/Mike,将使用Ben for和Mike for进行调用。 name1name2getDistance()name1name2
  • The two @PathParam annotations prefacing the parameters to getDistance() serve to map the contents of the {name1} and {name2} path placeholders to the method’s name1 and name2 parameters. Given the URI http://localhost/socnet/distance/Ben/Mike, getDistance() will be invoked with Ben for name1 and Mike for name2.
  • 构造函数中的注释@Context使此扩展被传递给服务器内嵌入的图形数据库的引用。服务器基础架构负责创建扩展并向其注入图形数据库实例,但此处参数的存在使此扩展极易测试。正如我们稍后将在“测试服务器扩展”GraphDatabaseService中看到的那样,我们可以对扩展进行单元测试,而无需在服务器内运行它们。
  • The @Context annotation in the constructor causes this extension to be handed a reference to the embedded graph database inside the server. The server infrastructure takes care of creating an extension and injecting it with a graph database instance, but the very presence of the GraphDatabaseService parameter here makes this extension exceedingly testable. As we’ll see later, in “Testing server extensions”, we can unit test extensions without having to run them inside a server.

服务器扩展可以成为我们应用程序架构中的强大元素。其主要优点包括:

Server extensions can be powerful elements in our application architecture. Their chief benefits include:

复杂交易
扩展使我们能够在单个事务中执行任意复杂的操作序列。
Extensions enable us to execute an arbitrarily complex sequence of operations in the context of a single transaction.
API 的选择
每个扩展程序被注入了对服务器核心中嵌入图形数据库的引用。这使我们能够访问所有 API(核心 API、遍历框架、图形算法包和 Cypher),以开发扩展程序的行为。
Each extension is injected with a reference to the embedded graph database at the heart of the server. This gives us access to the full range of APIs — Core API, Traversal Framework, graph algorithm package, and Cypher — for developing our extension’s behavior.
封装
因为每个扩展都隐藏在 RESTful 接口后面,我们可以随着时间的推移改进和修改其实现。
Because each extension is hidden behind a RESTful interface, we can improve and modify its implementation over time.
响应格式
我们控制响应 — 表示格式和 HTTP 标头。这使我们能够创建响应消息,其内容使用来自我们领域的术语,而不是标准 REST API 的基于图形的术语(例如,用户产品订单,而不是节点、关系和属性)。此外,在控制附加到响应的 HTTP 标头时,我们可以利用 HTTP 协议进行缓存和条件请求等操作。
We control the response — both the representation format and the HTTP headers. This enables us to create response messages whose contents employ terminology from our domain, rather than the graph-based terminology of the standard REST API (users, products, and orders, for example, rather than nodes, relationships, and properties). Further, in controlling the HTTP headers attached to the response, we can leverage the HTTP protocol for things such as caching and conditional requests.

当考虑使用服务器扩展时,我们应该牢记以下几点:

When considering using server extensions, we should bear in mind the following points:

仅限 JVM
作为通过针对嵌入式 Neo4j 进行开发,我们必须使用基于 JVM 的语言。
As with developing against embedded Neo4j, we’ll have to use a JVM-based language.
GC 行为
我们可以在服务器扩展中执行任意复杂(且危险)的事情。我们需要监控垃圾收集行为,以确保不会引入任何不良影响效果。
We can do arbitrarily complex (and dangerous) things inside a server extension. We need to monitor garbage collection behaviors to ensure that we don’t introduce any untoward side effects.

聚类

Clustering

正如我们在“可用性”中更详细地讨论的那样,Neo4j 集群使用主从复制来实现高可用性和水平读取扩展。在本节中,我们将讨论使用集群 Neo4j 时需要考虑的一些策略。

As we discuss in more detail in “Availability”, Neo4j clusters for high availability and horizontal read scaling using master-slave replication. In this section we discuss some of the strategies to consider when using clustered Neo4j.

复制

Replication

尽管对集群的所有写入都是通过主服务器协调的,但 Neo4j 确实允许通过从服务器写入,但即便如此,写入的从服务器在返回客户端之前也会与主服务器同步。由于额外的网络流量和协调协议,通过从服务器写入可能比直接写入主服务器慢一个数量级。通过从服务器写入的唯一原因是增加每次写入的持久性保证(写入在两个实例上持久,而不是一个实例)并确保在使用缓存分片时我们可以读取自己的写入(请参阅本章后面的“缓存分片”“读取自己的写入”)。由于 Neo4j 的较新版本允许我们指定将对主服务器的写入复制到一个或多个从服务器,从而增加对主服务器写入的持久性保证,因此通过从服务器写入的情况现在不那么引人注目了。今天建议将所有写入定向到主服务器,然后使用ha.tx_push_factorha.tx_push_strategy配置设置复制到从服务器。

Although all writes to a cluster are coordinated through the master, Neo4j does allow writing through slaves, but even then, the slave that’s being written to syncs with the master before returning to the client. Because of the additional network traffic and coordination protocol, writing through slaves can be an order of magnitude slower than writing directly to the master. The only reasons for writing through slaves are to increase the durability guarantees of each write (the write is made durable on two instances, rather than one) and to ensure that we can read our own writes when employing cache sharding (see “Cache sharding” and “Read your own writes” later in this chapter). Because newer versions of Neo4j enable us to specify that writes to the master be replicated out to one or more slaves, thereby increasing the durability guarantees of writes to the master, the case for writing through slaves is now less compelling. Today it is recommended that all writes be directed to the master, and then replicated to slaves using the ha.tx_push_factor and ha.tx_push_strategy configuration settings.

使用队列缓冲写入

Buffer writes using queues

在高写入负载情况下,我们可以使用队列来缓冲写入并调节负载。使用此策略,对集群的写入将缓冲在队列中。然后,工作程序将轮询队列并针对数据库执行批量写入。这不仅可以调节写入流量,还可以减少争用,并使我们能够在维护期间暂停写入操作而不拒绝客户端请求。

In high write load scenarios, we can use queues to buffer writes and regulate load. With this strategy, writes to the cluster are buffered in a queue. A worker then polls the queue and executes batches of writes against the database. Not only does this regulate write traffic, but it reduces contention and enables us to pause write operations without refusing client requests during maintenance periods.

全球集群

Global clusters

对于面向全球受众的应用程序,可以在多个数据中心和云平台上安装多区域集群,例如亚马逊 Web 服务 (AWS)。多区域集群使我们能够从地理位置上最接近客户端的集群部分提供读取服务。然而,在这些情况下,区域物理分离引入的延迟有时会破坏协调协议。因此,通常希望将主服务器重新选举限制在单个区域。为了实现这一点,我们为不想参与主服务器重新选举的实例创建仅从属数据库。我们通过ha.slave_coordinator_update_mode=none在实例的配置中包含配置参数来实现这一点。

For applications catering to a global audience, it is possible to install a multiregion cluster in multiple data centers and on cloud platforms such as Amazon Web Services (AWS). A multiregion cluster enables us to service reads from the portion of the cluster geographically closest to the client. In these situations, however, the latency introduced by the physical separation of the regions can sometimes disrupt the coordination protocol. It is, therefore, often desirable to restrict master reelection to a single region. To achieve this, we create slave-only databases for the instances we don’t want to participate in master reelection. We do this by including the ha.slave_coordinator_update_mode=none configuration parameter in an instance’s configuration.

负载均衡

Load Balancing

使用集群图形数据库时,我们应该考虑在集群中平衡流量负载,以帮助最大化吞吐量并减少延迟。Neo4j 不包含本机负载平衡器,而是依赖于网络基础设施的负载平衡功能。

When using a clustered graph database, we should consider load balancing traffic across the cluster to help maximize throughput and reduce latency. Neo4j doesn’t include a native load balancer, relying instead on the load-balancing capabilities of the network infrastructure.

将读取流量与写入流量分开

Separate read traffic from write traffic

鉴于建议将大部分写入流量导向主服务器,我们应该考虑明确区分读取请求和写入请求。我们应该配置负载均衡器以将写入流量导向主服务器,同时在整个集群中平衡读取流量。

Given the recommendation to direct the majority of write traffic to the master, we should consider clearly separating read requests from write requests. We should configure our load balancer to direct write traffic to the master, while balancing the read traffic across the entire cluster.

在基于 Web 的应用程序中,HTTP 方法通常足以区分具有显著副作用(写入)的请求和对服务器没有显著副作用的请求:POST、PUT 和 DELETE 可以修改服务器端资源,而 GET 没有副作用。

In a web-based application, the HTTP method is often sufficient to distinguish a request with a significant side effect — a write — from one that has no significant side effect on the server: POST, PUT, and DELETE can modify server-side resources, whereas GET is side-effect free.

@GET使用服务器扩展时,使用和批注区分读写操作非常重要@POST。如果我们的应用程序仅依赖于服务器扩展,那么这足以将两者区分开。但是,如果我们使用 REST API 向数据库提交 Cypher 查询,情况就没那么简单了。REST API 使用 POST 作为读写 Cypher 请求的一般“处理此”语义。为了在这种情况下分离读写请求,我们引入了一对负载均衡器:一个始终将请求定向到主服务器的写负载均衡器,以及一个在整个集群中平衡请求的读取负载均衡器。在我们的应用程序逻辑中,我们知道操作是读还是写,然后我们必须决定对任何特定请求应该使用两个地址中的哪一个,如图4-9所示。

When using server extensions, it’s important to distinguish read and write operations using @GET and @POST annotations. If our application depends solely on server extensions, this will suffice to separate the two. If we’re using the REST API to submit Cypher queries to the database, however, the situation is not so straightforward. The REST API uses POST as a general “process this” semantic for both read and write Cypher requests. To separate read and write requests in this scenario, we introduce a pair of load balancers: a write load balancer that always directs requests to the master, and a read load balancer that balances requests across the entire cluster. In our application logic, where we know whether the operation is a read or a write, we will then have to decide which of the two addresses we should use for any particular request, as illustrated in Figure 4-9.

在服务器模式下运行时,Neo4j 会公开一个 URI,指示该实例当前是否为主实例,如果不是,则指示哪个实例是主实例。负载平衡器可以定期轮询此 URI,以确定将流量路由到何处。

When running in server mode, Neo4j exposes a URI that indicates whether that instance is currently the master, and if it isn’t, which of the instances is the master. Load balancers can poll this URI at intervals to determine where to route traffic.

缓存分片

Cache sharding

当满足查询需求的图部分驻留在主内存中时,查询运行速度最快。当图包含数十亿个节点、关系和属性时,并非所有节点、关系和属性都适合主内存。其他数据技术通常通过对数据进行分区来解决此问题,但对于图,分区或分片异常困难(请参阅“图可扩展性的圣杯”)。那么,我们如何才能在非常大的图上提供高性能查询?

Queries run fastest when the portions of the graph needed to satisfy them reside in main memory. When a graph holds many billions of nodes, relationships, and properties, not all of it will fit into main memory. Other data technologies often solve this problem by partitioning their data, but with graphs, partitioning or sharding is unusually difficult (see “The Holy Grail of Graph Scalability”). How, then, can we provide for high-performance queries over a very large graph?

一种解决方案是使用一种称为缓存分片的技术(图 4-10),该技术包括将每个请求路由到 HA 集群中的数据库实例,其中满足该请求所需的图形部分可能已经在主内存中(请记住:集群中的每个实例都将包含数据的完整副本)。如果应用程序的大多数查询都是图形本地查询,这意味着它们从图形中的一个或多个特定点开始并遍历周围的子图,那么从同一组起点开始的查询一致地路由到同一数据库实例的机制将增加每个查询命中热缓存的可能性。

One solution is to use a technique called cache sharding (Figure 4-10), which consists of routing each request to a database instance in an HA cluster where the portion of the graph necessary to satisfy that request is likely already in main memory (remember: every instance in the cluster will contain a full copy of the data). If the majority of an application’s queries are graph-local queries, meaning they start from one or more specific points in the graph and traverse the surrounding subgraphs, then a mechanism that consistently routes queries beginning from the same set of start points to the same database instance will increase the likelihood of each query hitting a warm cache.

格德布 0409
图 4-9。使用读/写负载平衡器将请求定向到群集

实现一致路由的策略因域而异。有时,有粘性会话就足够了;其他时候,我们希望根据数据集的特征进行路由。最简单的策略是让首先为特定用户提供请求的实例,然后为该用户提供后续请求。其他特定于域的方法也有效。例如,在地理数据系统中,我们可以将有关特定位置的请求路由到已为该位置预热的特定数据库实例。这两种策略都增加了所需节点和关系已缓存在主内存中的可能性,在那里可以快速访问和处理它们。

The strategy used to implement consistent routing will vary by domain. Sometimes it’s good enough to have sticky sessions; other times we’ll want to route based on the characteristics of the dataset. The simplest strategy is to have the instance that first serves requests for a particular user thereafter serve subsequent requests for that user. Other domain-specific approaches will also work. For example, in a geographical data system we can route requests about particular locations to specific database instances that have been warmed for that location. Both strategies increase the likelihood of the required nodes and relationships already being cached in main memory, where they can be quickly accessed and processed.

格数据库 0410
图 4-10.缓存分片

阅读你自己的作品

Read your own writes

有时我们可能需要读取自己的写入内容 — 通常是当应用程序应用最终用户更改时,并且需要在下一个请求中将此更改的影响反映给用户。虽然对主服务器的写入是立即一致的,但整个集群最终是一致的。我们如何确保指向主服务器的写入反映在下一个负载平衡的读取请求中?一种解决方案是使用缓存分片中使用的相同一致性路由技术将写入定向到将用于服务后续读取的从服务器。这假设写入和读取可以根据每个请求中的某些域标准进行一致路由。

Occasionally we may need to read our own writes — typically when the application applies an end-user change, and needs on the next request to reflect the effect of this change back to the user. Whereas writes to the master are immediately consistent, the cluster as a whole is eventually consistent. How can we ensure that a write directed to the master is reflected in the next load-balanced read request? One solution is to use the same consistent routing technique used in cache sharding to direct the write to the slave that will be used to service the subsequent read. This assumes that the write and the read can be consistently routed based on some domain criteria in each request.

这是少数几个通过从服务器写入有意义的情况之一。但请记住:通过从服务器写入可能比直接写入主服务器慢一个数量级。我们应该谨慎使用这种技术。如果我们的大部分写入需要我们读取自己的写入,那么这种技术将显著影响吞吐量和延时。

This is one of the few occasions where it makes sense to write through a slave. But remember: writing through a slave can be an order of magnitude slower than writing directly to the master. We should use this technique sparingly. If a high proportion of our writes require us to read our own write, this technique will significantly impact throughput and latency.

测试

Testing

测试是应用程序开发过程的基本部分 - 不仅可以作为验证查询或应用程序功能是否正常运行的手段,还可以作为设计和记录应用程序及其数据模型的一种方式。在本节中,我们强调测试是一项日常活动;通过以测试驱动的方式开发我们的图形数据库解决方案,我们可以实现系统的快速发展,并持续响应新的业务需求。

Testing is a fundamental part of the application development process — not only as a means of verifying that a query or application feature behaves correctly, but also as a way of designing and documenting our application and its data model. Throughout this section we emphasize that testing is an everyday activity; by developing our graph database solution in a test-driven manner, we provide for the rapid evolution of our system, and its continued responsiveness to new business needs.

测试驱动的数据模型开发

Test-Driven Data Model Development

在讨论数据建模时,我们强调图形模型应该反映我们想要针对其运行的查询类型。通过以测试驱动的方式开发数据模型,我们可以记录对领域的理解,并验证查询是否正确运行。

In discussing data modeling, we’ve stressed that our graph model should reflect the kinds of queries we want to run against it. By developing our data model in a test-driven fashion we document our understanding of our domain, and validate that our queries behave correctly.

通过测试驱动的数据建模,我们根据从我们的领域中提取的小型代表性示例图编写单元测试。这些示例图包含的数据刚好足以传达领域的特定特征。在许多情况下,它们可能仅包含 10 个左右的节点,以及连接它们的关系。我们使用这些示例来描述领域的正常情况以及异常情况。当我们在真实数据中发现异常和极端情况时,我们会编写一个测试来重现我们发现的情况。

With test-driven data modeling, we write unit tests based on small, representative example graphs drawn from our domain. These example graphs contain just enough data to communicate a particular feature of the domain. In many cases, they might only comprise 10 or so nodes, plus the relationships that connect them. We use these examples to describe what is normal for the domain, and also what is exceptional. As we discover anomalies and corner cases in our real data, we write a test that reproduces what we’ve discovered.

我们为每个测试创建的示例图构成了该测试的设置或上下文。在此上下文中,我们执行查询,并断言查询的行为符合预期。由于我们控制测试数据的内容,因此作为测试的作者,我们知道预期的结果。

The example graphs we create for each test comprise the setup or context for that test. Within this context we exercise a query, and assert that the query behaves as expected. Because we control the contents of the test data, we, as the author of the test, know what results to expect.

测试可以充当文档。通过阅读测试,开发人员可以了解应用程序旨在解决的问题和需求,以及作者解决这些问题的方式。考虑到这一点,最好使用每个测试来测试我们领域的一个方面。阅读大量小型测试要容易得多,每个测试都以清晰、简单和简洁的方式传达我们数据的一个离散特征,而不是通过单个大型且笨拙的测试对复杂领域进行逆向工程。在许多情况下,我们会发现一个特定的查询被几个测试执行,其中一些测试展示了我们领域的顺利路径,另一些测试则在某些特殊结构或值集的上下文中执行它。5

Tests can act like documentation. By reading the tests, developers gain an understanding of the problems and needs the application is intended to address, and the ways in which the authors have gone about addressing them. With this in mind, it’s best to use each test to test just one aspect of our domain. It’s far easier to read lots of small tests, each of which communicates a discrete feature of our data in a clear, simple, and concise fashion, than it is to reverse engineer a complex domain from a single large and unwieldy test. In many cases, we’ll find a particular query being exercised by several tests, some of which demonstrate the happy path through our domain, others of which exercise it in the context of some exceptional structure or set of values.5

随着时间的推移,我们将建立一套可以充当强大回归测试机制的测试。随着应用程序的发展,我们添加新的数据源,或更改模型以满足新的需求,我们的回归测试套件将继续断言现有功能仍然按应有的方式运行。演进架构以及支持它们的增量和迭代软件开发技术依赖于断言行为的基础。此处描述的数据模型开发单元测试方法使开发人员能够响应新的业务需求,而几乎不会破坏或破坏以前的解决方案,并且对解决方案的持续质量充满信心。

Over time, we’ll build up a suite of tests that can act as a powerful regression test mechanism. As our application evolves, and we add new sources of data, or change the model to meet new needs, our regression test suite will continue to assert that existing features still behave as they should. Evolutionary architectures, and the incremental and iterative software development techniques that support them, depend upon a bedrock of asserted behavior. The unit-testing approach to data model development described here enables developers to respond to new business needs with very little risk of undermining or breaking what has come before, confident in the continued quality of the solution.

示例:测试驱动的社交网络数据模型

Example: A test-driven social network data model

在此示例中,我们将演示如何为社交网络开发一个非常简单的 Cypher 查询。给定网络中几个成员的姓名,我们的查询将确定他们之间的距离。

In this example we’re going to demonstrate developing a very simple Cypher query for a social network. Given the names of a couple of members of the network, our query determines the distance between them.

首先,我们创建一个代表我们领域的小图。使用 Cypher,我们创建一个包含 10 个节点和 8 个关系的网络:

First, we create a small graph that is representative of our domain. Using Cypher, we create a network comprising 10 nodes and 8 relationships:

public GraphDatabaseService createDatabase()
{
    // Create nodes
    String createGraph = "CREATE\n" +
        "(ben:User {name:'Ben'}),\n" +
        "(arnold:User {name:'Arnold'}),\n" +
        "(charlie:User {name:'Charlie'}),\n" +
        "(gordon:User {name:'Gordon'}),\n" +
        "(lucy:User {name:'Lucy'}),\n" +
        "(emily:User {name:'Emily'}),\n" +
        "(sarah:User {name:'Sarah'}),\n" +
        "(kate:User {name:'Kate'}),\n" +
        "(mike:User {name:'Mike'}),\n" +
        "(paula:User {name:'Paula'}),\n" +
        "(ben)-[:FRIEND]->(charlie),\n" +
        "(charlie)-[:FRIEND]->(lucy),\n" +
        "(lucy)-[:FRIEND]->(sarah),\n" +
        "(sarah)-[:FRIEND]->(mike),\n" +
        "(arnold)-[:FRIEND]->(gordon),\n" +
        "(gordon)-[:FRIEND]->(emily),\n" +
        "(emily)-[:FRIEND]->(kate),\n" +
        "(kate)-[:FRIEND]->(paula)";

    String createIndex = "CREATE INDEX ON :User(name)";

    GraphDatabaseService db =
        new TestGraphDatabaseFactory().newImpermanentDatabase();

    db.execute( createGraph );
    db.execute( createIndex );

    return db;
}
public GraphDatabaseService createDatabase()
{
    // Create nodes
    String createGraph = "CREATE\n" +
        "(ben:User {name:'Ben'}),\n" +
        "(arnold:User {name:'Arnold'}),\n" +
        "(charlie:User {name:'Charlie'}),\n" +
        "(gordon:User {name:'Gordon'}),\n" +
        "(lucy:User {name:'Lucy'}),\n" +
        "(emily:User {name:'Emily'}),\n" +
        "(sarah:User {name:'Sarah'}),\n" +
        "(kate:User {name:'Kate'}),\n" +
        "(mike:User {name:'Mike'}),\n" +
        "(paula:User {name:'Paula'}),\n" +
        "(ben)-[:FRIEND]->(charlie),\n" +
        "(charlie)-[:FRIEND]->(lucy),\n" +
        "(lucy)-[:FRIEND]->(sarah),\n" +
        "(sarah)-[:FRIEND]->(mike),\n" +
        "(arnold)-[:FRIEND]->(gordon),\n" +
        "(gordon)-[:FRIEND]->(emily),\n" +
        "(emily)-[:FRIEND]->(kate),\n" +
        "(kate)-[:FRIEND]->(paula)";

    String createIndex = "CREATE INDEX ON :User(name)";

    GraphDatabaseService db =
        new TestGraphDatabaseFactory().newImpermanentDatabase();

    db.execute( createGraph );
    db.execute( createIndex );

    return db;
}

在 中有两点值得关注createDatabase()。首先是使用ImpermanentGraphDatabase,它是 Neo4j 的轻量级内存版本,专为单元测试而设计。通过使用ImpermanentGraphDatabase,我们避免了在每次测试后清理磁盘上的存储文件。该类可以在 Neo4j 内核测试 jar 中找到,可以通过以下依赖项引用获取:

There are two things of interest in createDatabase(). The first is the use of ImpermanentGraphDatabase, which is a lightweight, in-memory version of Neo4j, designed specifically for unit testing. By using ImpermanentGraphDatabase, we avoid having to clear up store files on disk after each test. The class can be found in the Neo4j kernel test jar, which can be obtained with the following dependency reference:

<dependency>
    <groupId>org.neo4j </groupId>
    <artifactId>neo4j-kernel </artifactId>
    <version>${project.version} </version>
    <type>test-jar</type>
    <scope>测试</scope>
</dependency>
<dependency>
    <groupId>org.neo4j</groupId>
    <artifactId>neo4j-kernel</artifactId>
    <version>${project.version}</version>
    <type>test-jar</type>
    <scope>test</scope>
</dependency>

警告

ImpermanentGraphDatabase仅供单元测试使用。它是 Neo4j 的内存版本,不适用于生产用途。

ImpermanentGraphDatabase is intended for use in unit-tests only. It is an in-memory only version of Neo4j, not intended for production use.


第二个值得关注的createDatabase()是 Cypher 命令,用于根据给定属性对具有给定标签的节点进行索引。在本例中,我们说的是,我们希望:User根据节点的属性值对具有name标签的节点进行索引。

The second thing of interest in createDatabase() is the Cypher command to index nodes with a given label on a given property. In this case we’re saying that we want to index nodes with a :User label based on the value of their name property.

创建示例图后,我们现在可以编写第一个测试。以下是用于测试我们的社交网络数据模型及其查询的测试装置:

Having created a sample graph, we can now write our first test. Here’s the test fixture for testing our social network data model and its queries:

public class SocialNetworkTest
{
    private static GraphDatabaseService db;
    private static SocialNetworkQueries queries;

    @BeforeClass
    public static void init()
    {
        db = createDatabase();
        queries = new SocialNetworkQueries( db );
    }

    @AfterClass
    public static void shutdown()
    {
        db.shutdown();
    }

    @Test
    public void shouldReturnShortestPathBetweenTwoFriends() throws Exception
    {
        // when
        Result result = queries.distance( "Ben", "Mike" );

        // then
        assertTrue( result.hasNext() );
        assertEquals( 4, result.next().get( "distance" ) );
    }

    // more tests
}
public class SocialNetworkTest
{
    private static GraphDatabaseService db;
    private static SocialNetworkQueries queries;

    @BeforeClass
    public static void init()
    {
        db = createDatabase();
        queries = new SocialNetworkQueries( db );
    }

    @AfterClass
    public static void shutdown()
    {
        db.shutdown();
    }

    @Test
    public void shouldReturnShortestPathBetweenTwoFriends() throws Exception
    {
        // when
        Result result = queries.distance( "Ben", "Mike" );

        // then
        assertTrue( result.hasNext() );
        assertEquals( 4, result.next().get( "distance" ) );
    }

    // more tests
}

此测试装置包括一个初始化方法,用 注释@BeforeClass,该方法在任何测试开始之前执行。在这里我们调用createDatabase()来创建示例图的一个实例,以及 的一个实例SocialNetworkQueries,其中包含正在开发的查询。

This test fixture includes an initialization method, annotated with @BeforeClass, which executes before any tests start. Here we call createDatabase() to create an instance of the sample graph, and an instance of SocialNetworkQueries, which houses the queries under development.

我们的第一个测试shouldReturnShortestPathBetweenTwoFriends()测试正在开发的查询是否可以找到网络任意两个成员之间的路径——在本例中为BenMike。给定示例图的内容,我们知道BenMike是连接的,但只是远程连接,距离为 4。因此,测试断言查询返回一个包含distance值 4 的非空结果。

Our first test, shouldReturnShortestPathBetweenTwoFriends(), tests that the query under development can find a path between any two members of the network — in this case, Ben and Mike. Given the contents of the sample graph, we know that Ben and Mike are connected, but only remotely, at a distance of 4. The test, therefore, asserts that the query returns a nonempty result containing a distance value of 4.

编写完测试后,我们现在开始开发第一个查询。以下是的实现SocialNetworkQueries

Having written the test, we now start developing our first query. Here’s the implementation of SocialNetworkQueries:

public class SocialNetworkQueries
{
    private final GraphDatabaseService db;

    public SocialNetworkQueries( GraphDatabaseService db )
    {
        this.db = db;
    }

    public Result distance( String firstUser, String secondUser )
    {
        String query = "MATCH (first:User {name:{firstUser}}),\n" +
            "(second:User {name:{secondUser}})\n" +
            "MATCH p=shortestPath((first)-[*..4]-(second))\n" +
            "RETURN length(p) AS distance";

        Map<String, Object> params = new HashMap<String, Object>();
        params.put( "firstUser", firstUser );
        params.put( "secondUser",  secondUser );

        return db.execute( query, params );
    }

    // More queries
}
public class SocialNetworkQueries
{
    private final GraphDatabaseService db;

    public SocialNetworkQueries( GraphDatabaseService db )
    {
        this.db = db;
    }

    public Result distance( String firstUser, String secondUser )
    {
        String query = "MATCH (first:User {name:{firstUser}}),\n" +
            "(second:User {name:{secondUser}})\n" +
            "MATCH p=shortestPath((first)-[*..4]-(second))\n" +
            "RETURN length(p) AS distance";

        Map<String, Object> params = new HashMap<String, Object>();
        params.put( "firstUser", firstUser );
        params.put( "secondUser",  secondUser );

        return db.execute( query, params );
    }

    // More queries
}

在构造函数中,SocialNetworkQueries我们将提供的数据库实例存储在成员变量中,这样就可以在实例的整个生命周期内反复使用它queries。我们在方法中实现查询本身distance()。在这里,我们创建一个 Cypher 语句,初始化一个包含查询参数的映射,然后执行该语句。

In the constructor for SocialNetworkQueries we store the supplied database instance in a member variable, which allows it to be reused over and again throughout the lifetime of the queries instance. The query itself we implement in the distance() method. Here we create a Cypher statement, initialize a map containing the query parameters, and execute the statement.

如果shouldReturnShortestPathBetweenTwoFriends()通过(确实通过了),我们将继续测试其他场景。例如,如果网络的两个成员之间相隔超过四个连接,会发生什么情况?我们在另一个测试中写下场景以及我们期望查询执行的操作:

If shouldReturnShortestPathBetweenTwoFriends() passes (it does), we then go on to test additional scenarios. What happens, for example, if two members of the network are separated by more than four connections? We write up the scenario and what we expect the query to do in another test:

@Test
public void shouldReturnNoResultsWhenNoPathAtDistance4OrLess()
    throws Exception
{
    // when
    Result result = queries.distance( "Ben", "Arnold" );

    // then
    assertFalse( result.hasNext() );
}
@Test
public void shouldReturnNoResultsWhenNoPathAtDistance4OrLess()
    throws Exception
{
    // when
    Result result = queries.distance( "Ben", "Arnold" );

    // then
    assertFalse( result.hasNext() );
}

在这种情况下,第二个测试通过了,我们无需修改底层 Cypher 查询。然而,在许多情况下,新测试将迫使我们修改查询的实现。当这种情况发生时,我们会修改查询以使新测试通过,然后运行装置中的所有测试。装置中任何地方的测试失败都表明我们破坏了某些现有功能。我们继续修改查询,直到所有测试再次通过。

In this instance, this second test passes without us having to modify the underlying Cypher query. In many cases, however, a new test will force us to modify a query’s implementation. When that happens, we modify the query to make the new test pass, and then run all the tests in the fixture. A failing test anywhere in the fixture indicates we’ve broken some existing functionality. We continue to modify the query until all tests are green once again.

测试服务器扩展

Testing server extensions

服务器扩展可以像嵌入式 Neo4j 一样轻松地以测试驱动的方式进行开发。使用前面描述的简单服务器扩展,我们对其进行测试的方法如下:

Server extensions can be developed in a test-driven manner just as easily as embedded Neo4j. Using the simple server extension described earlier, here’s how we test it:

@Test
public void extensionShouldReturnDistance() throws Exception
{
    // given
    SocialNetworkExtension extension = new SocialNetworkExtension( db );

    // when
    String distance = extension.getDistance( "Ben", "Mike" );

    // then
    assertEquals( "4", distance );
}
@Test
public void extensionShouldReturnDistance() throws Exception
{
    // given
    SocialNetworkExtension extension = new SocialNetworkExtension( db );

    // when
    String distance = extension.getDistance( "Ben", "Mike" );

    // then
    assertEquals( "4", distance );
}

因为扩展的构造函数接受一个GraphDatabaseService实例,我们可以注入一个测试实例(一个ImpermanentGraphDatabase实例),然后像任何其他对象一样调用它的方法。

Because the extension’s constructor accepts a GraphDatabaseService instance, we can inject a test instance (an ImpermanentGraphDatabase instance), and then call its methods as per any other object.

但是,如果我们想要测试在服务器内部运行的扩展,则需要进行一些设置:

If, however, we wanted to test the extension running inside a server, we have a little more setup to do:

public class SocialNetworkExtensionTest
{
    private ServerControls server;

    @BeforeClass
    public static void init() throws IOException
    {
        // Create nodes
        String createGraph = "CREATE\n" +
            "(ben:User {name:'Ben'}),\n" +
            "(arnold:User {name:'Arnold'}),\n" +
            "(charlie:User {name:'Charlie'}),\n" +
            "(gordon:User {name:'Gordon'}),\n" +
            "(lucy:User {name:'Lucy'}),\n" +
            "(emily:User {name:'Emily'}),\n" +
            "(sarah:User {name:'Sarah'}),\n" +
            "(kate:User {name:'Kate'}),\n" +
            "(mike:User {name:'Mike'}),\n" +
            "(paula:User {name:'Paula'}),\n" +
            "(ben)-[:FRIEND]->(charlie),\n" +
            "(charlie)-[:FRIEND]->(lucy),\n" +
            "(lucy)-[:FRIEND]->(sarah),\n" +
            "(sarah)-[:FRIEND]->(mike),\n" +
            "(arnold)-[:FRIEND]->(gordon),\n" +
            "(gordon)-[:FRIEND]->(emily),\n" +
            "(emily)-[:FRIEND]->(kate),\n" +
            "(kate)-[:FRIEND]->(paula)";

        server = TestServerBuilders
            .newInProcessBuilder()
            .withExtension(
                "/socnet",
                ColleagueFinderExtension.class )
            .withFixture( createGraph )
            .newServer();
    }

    @AfterClass
    public static void teardown()
    {
        server.close();
    }

    @Test
    public void serverShouldReturnDistance() throws Exception
    {
        HTTP.Response response = HTTP.GET( server.httpURI()
            .resolve( "/socnet/distance/Ben/Mike" ).toString() );

        assertEquals( 200, response.status() );
        assertEquals( "text/plain", response.header( "Content-Type" ));
        assertEquals( "4", response.rawContent( ) );
    }
}
public class SocialNetworkExtensionTest
{
    private ServerControls server;

    @BeforeClass
    public static void init() throws IOException
    {
        // Create nodes
        String createGraph = "CREATE\n" +
            "(ben:User {name:'Ben'}),\n" +
            "(arnold:User {name:'Arnold'}),\n" +
            "(charlie:User {name:'Charlie'}),\n" +
            "(gordon:User {name:'Gordon'}),\n" +
            "(lucy:User {name:'Lucy'}),\n" +
            "(emily:User {name:'Emily'}),\n" +
            "(sarah:User {name:'Sarah'}),\n" +
            "(kate:User {name:'Kate'}),\n" +
            "(mike:User {name:'Mike'}),\n" +
            "(paula:User {name:'Paula'}),\n" +
            "(ben)-[:FRIEND]->(charlie),\n" +
            "(charlie)-[:FRIEND]->(lucy),\n" +
            "(lucy)-[:FRIEND]->(sarah),\n" +
            "(sarah)-[:FRIEND]->(mike),\n" +
            "(arnold)-[:FRIEND]->(gordon),\n" +
            "(gordon)-[:FRIEND]->(emily),\n" +
            "(emily)-[:FRIEND]->(kate),\n" +
            "(kate)-[:FRIEND]->(paula)";

        server = TestServerBuilders
            .newInProcessBuilder()
            .withExtension(
                "/socnet",
                ColleagueFinderExtension.class )
            .withFixture( createGraph )
            .newServer();
    }

    @AfterClass
    public static void teardown()
    {
        server.close();
    }

    @Test
    public void serverShouldReturnDistance() throws Exception
    {
        HTTP.Response response = HTTP.GET( server.httpURI()
            .resolve( "/socnet/distance/Ben/Mike" ).toString() );

        assertEquals( 200, response.status() );
        assertEquals( "text/plain", response.header( "Content-Type" ));
        assertEquals( "4", response.rawContent( ) );
    }
}

这里我们使用 的一个实例ServerControls来托管扩展。我们init()使用 提供的构建器在测试装置的方法中创建服务器并填充其数据库。此构建器使我们能够注册扩展,并将其与相对 URI 空间关联(在本例中为/socnetTestServerBuilders下的所有内容)。完成后,我们将启动并运行数据库服务器实例。init()

Here we’re using an instance of ServerControls to host the extension. We create the server and populate its database in the test fixture’s init() method using the builder supplied by TestServerBuilders. This builder enables us to register the extension, and associate it with a relative URI space (in this case, everything below /socnet). Once init() has completed, we have a database server instance up and running.

在测试本身中serverShouldReturnDistance(),我们使用 Neo4j 测试库中的 HTTP 客户端访问此服务器。客户端对/socnet/distance/Ben/Mike处的资源发出 HTTP GET 请求。(在服务器端,此请求被分派到 的一个实例SocialNetworkExtension。)当客户端收到响应时,测试断言 HTTP 状态代码、内容类型和响应主体的内容是正确的。

In the test itself, serverShouldReturnDistance(), we access this server using an HTTP client from the Neo4j test library. The client issues an HTTP GET request for the resource at /socnet/distance/Ben/Mike. (At the server end, this request is dispatched to an instance of SocialNetworkExtension.) When the client receives a response, the test asserts that the HTTP status code, content-type, and content of the response body are correct.

性能测试

Performance Testing

到目前为止,我们描述的测试驱动方法传达了上下文和领域理解,并测试了正确性。但是,它不测试性能。对小型 20 节点样本图有效的方法在面对更大的图时可能效果不佳。因此,为了配合我们的单元测试,我们应该考虑编写一套查询性能测试。除此之外,我们还应该在应用程序开发生命周期的早期投资进行一些彻底的应用程序性能测试。

The test-driven approach that we’ve described so far communicates context and domain understanding, and tests for correctness. It does not, however, test for performance. What works fast against a small, 20-node sample graph may not work so well when confronted with a much larger graph. Therefore, to accompany our unit tests, we should consider writing a suite of query performance tests. On top of that, we should also invest in some thorough application performance testing early in our application’s development life cycle.

查询性能测试

Query performance tests

查询性能测试与全面的应用程序性能测试不同。我们在此阶段感兴趣的是,特定查询在针对与我们预期在生产中遇到的图大致一样大的图运行时是否表现良好。理想情况下,这些测试与我们的单元测试并行开发。没有什么比投入大量时间完善查询,却发现它不适合生产规模的数据更糟糕的了。

Query performance tests are not the same as full-blown application performance tests. All we’re interested in at this stage is whether a particular query performs well when run against a graph that is roughly as big as the kind of graph we expect to encounter in production. Ideally, these tests are developed side-by-side with our unit tests. There’s nothing worse than investing a lot of time in perfecting a query, only to discover it is not fit for production-sized data.

创建查询性能测试时,请牢记以下准则:

When creating query performance tests, bear in mind the following guidelines:

  • 创建一套性能测试,用于执行通过单元测试开发的查询。记录性能数据,以便我们了解调整查询、修改堆大小或将图形数据库从一个版本升级到另一个版本的相对效果。
  • Create a suite of performance tests that exercise the queries developed through our unit testing. Record the performance figures so that we can see the relative effects of tweaking a query, modifying the heap size, or upgrading from one version of a graph database to another.
  • 经常运行这些测试,这样我们就能快速发现性能下降的情况。我们可能会考虑将这些测试纳入持续交付构建管道,如果测试结果超过某个值,则构建失败。
  • Run these tests often, so that we quickly become aware of any deterioration in performance. We might consider incorporating these tests into a continuous delivery build pipeline, failing the build if the test results exceed a certain value.
  • 在单个线程中进程内运行这些测试。此阶段无需模拟多个客户端:如果单个客户端的性能不佳,则多个客户端的性能不太可能提高。尽管严格来说它们不是单元测试,但我们可以使用开发单元测试时使用的相同单元测试框架来驱动它们。
  • Run these tests in-process on a single thread. There’s no need to simulate multiple clients at this stage: if the performance is poor for a single client, it’s unlikely to improve for multiple clients. Even though they are not, strictly speaking, unit tests, we can drive them using the same unit testing framework we use to develop our unit tests.
  • 多次运行每个查询,每次随机选择起始节点,这样我们就可以看到从冷缓存开始的效果,然后随着多个查询的执行逐渐变暖。
  • Run each query many times, picking starting nodes at random each time, so that we can see the effect of starting from a cold cache, which is then gradually warmed as multiple queries execute.

应用程序性能测试

Application performance tests

应用程序性能测试与查询性能测试不同,它在代表性生产使用场景下测试整个应用程序的性能。

Application performance tests, as distinct from query performance tests, test the performance of the entire application under representative production usage scenarios.

与查询性能测试一样,我们建议将此类性能测试作为日常开发的一部分,与应用程序功能的开发同时进行,而不是作为一个独立的项目阶段。6为了在项目生命周期的早期促进应用程序性能测试,通常需要开发“行走骨架”是贯穿整个系统的端到端切片,性能测试客户端可以访问和使用。通过开发行走骨架,我们不仅可以提供性能测试,还可以为解决方案的图形数据库部分建立架构环境。这使我们能够验证应用程序架构,并确定允许对各个组件进行离散测试的层和抽象。

As with query performance tests, we recommend that this kind of performance testing be done as part of everyday development, side-by-side with the development of application features, rather than as a distinct project phase.6 To facilitate application performance testing early in the project life cycle, it is often necessary to develop a “walking skeleton,” an end-to-end slice through the entire system, which can be accessed and exercised by performance test clients. By developing a walking skeleton, we not only provide for performance testing, but we also establish the architectural context for the graph database part of our solution. This enables us to verify our application architecture, and identify layers and abstractions that allow for discrete testing of individual components.

性能测试有两个目的:展示系统在生产环境中的性能,并找出操作可行性,以便更轻松地诊断性能问题、错误行为和错误。我们在创建和维护性能测试环境时学到的知识将在实际部署和操作系统时证明是无价之宝。

Performance tests serve two purposes: they demonstrate how the system will perform when used in production, and they drive out the operational affordances that make it easier to diagnose performance issues, incorrect behavior, and bugs. What we learn in creating and maintaining a performance test environment will prove invaluable when it comes to deploying and operating the system for real.

在制定性能测试标准时,我们建议指定百分位数而不是平均值。永远不要假设响应时间呈正态分布:现实世界并非如此。对于某些应用程序,我们可能希望确保所有请求都在一定时间内返回。在极少数情况下,让第一个请求与缓存预热时一样快可能很重要。但在大多数情况下,我们希望确保大多数请求在一定时间内返回;也就是说,98% 的请求在 200 毫秒内得到满足。记录后续测试运行非常重要,这样我们就可以比较一段时间内的性能数据,从而快速识别速度减慢和异常行为。

When drawing up the criteria for a performance test, we recommend specifying percentiles rather than averages. Never assume a normal distribution of response times: the real world doesn’t work like that. For some applications we may want to ensure that all requests return within a certain time period. In rare circumstances it may be important for the very first request to be as quick as when the caches have been warmed. But in the majority of cases, we will want to ensure that the majority of requests return within a certain time period; that, say, 98% of requests are satisfied in under 200 ms. It is important to keep a record of subsequent test runs so that we can compare performance figures over time, and thereby quickly identify slowdowns and anomalous behavior.

与单元测试和查询性能测试一样,应用程序性能测试在自动化交付管道中使用时最有价值,在该管道中,应用程序的连续构建会自动部署到测试环境,执行测试,并自动分析结果。应存储日志文件和测试结果以供以后检索、分析和比较。回归和失败应该导致构建失败,从而促使开发人员及时解决问题。在应用程序开发生命周期中而不是在最后进行性能测试的一大优势是,失败和回归通常可以追溯到最近的开发部分。这使我们能够快速而简洁地诊断、查明和解决问题。

As with unit tests and query performance tests, application performance tests prove most valuable when employed in an automated delivery pipeline, where successive builds of the application are automatically deployed to a testing environment, the tests executed, and the results automatically analyzed. Log files and test results should be stored for later retrieval, analysis, and comparison. Regressions and failures should fail the build, prompting developers to address the issues in a timely manner. One of the big advantages of conducting performance testing over the course of an application’s development life cycle, rather than at the end, is that failures and regressions can very often be tied back to a recent piece of development. This enables us to diagnose, pinpoint, and remedy issues rapidly and succinctly.

为了生成负载,我们需要一个负载生成代理。对于 Web 应用程序,有几种开源压力和负载测试工具可用的测试工具包括GrinderJMeter、和Gatling。7在测试负载均衡的 Web 应用程序时,我们应该确保测试客户端分布在不同的 IP 地址上,以便请求在整个集群中保持平衡

For generating load, we’ll need a load-generating agent. For web applications, there are several open source stress and load testing tools available, including Grinder, JMeter, and Gatling.7 When testing load-balanced web applications, we should ensure that our test clients are distributed across different IP addresses so that requests are balanced across the cluster.

使用代表性数据进行测试

Testing with representative data

对于查询性能测试和应用程序性能测试,我们都需要一个代表我们在生产中遇到的数据的数据集。因此,有必要创建或获取这样的数据集。在某些情况下,我们可以从第三方获取数据集,或者调整我们拥有的现有数据集;无论哪种方式,除非数据已经是图形形式,否则我们必须编写一些自定义导出导入代码。

For both query performance testing and application performance testing we will need a dataset that is representative of the data we will encounter in production. It will be necessary, therefore, to either create or source such a dataset. In some cases we can obtain a dataset from a third party, or adapt an existing dataset that we own; either way, unless the data is already in the form of a graph, we will have to write some custom export-import code.

然而,在许多情况下,我们都是白手起家。如果是这种情况,我们必须花一些时间来创建数据集构建器。与软件开发生命周期的其余部分一样,最好以迭代和增量的方式完成此操作。每当我们在域的数据模型中引入新元素时(如单元测试中记录和测试的那样),我们都会将相应的元素添加到性能数据集构建器中。这样,我们的性能测试将尽可能接近我们目前对域的理解所允许的实际使用情况。

In many cases, however, we’re starting from scratch. If this is the case, we must dedicate some time to creating a dataset builder. As with the rest of the software development life cycle, this is best done in an iterative and incremental fashion. Whenever we introduce a new element into our domain’s data model, as documented and tested in our unit tests, we add the corresponding element to our performance dataset builder. That way, our performance tests will come as close to real-world usage as our current understanding of the domain allows.

在创建代表性数据集时,我们会尝试重现已识别的任何域不变量:每个节点的最小、最大和平均关系数、不同关系类型的分布、属性值范围等等。当然,我们并不总是能够提前知道这些事情,而且我们经常会发现自己只能使用粗略的估计,直到有生产数据可用来验证我们的假设为止。

When creating a representative dataset, we try to reproduce any domain invariants we have identified: the minimum, maximum, and average number of relationships per node, the spread of different relationship types, property value ranges, and so on. Of course, it’s not always possible to know these things upfront, and often we’ll find ourselves working with rough estimates until such point as production data is available to verify our assumptions.

尽管理想情况下我们总是使用生产规模的数据集进行测试,但在测试环境中重现大量数据通常是不可能的或不理想的。在这种情况下,我们至少应该确保构建一个代表性数据集,其大小超出了我们在主内存中保存整个图的能力。这样,我们将能够观察缓存驱逐的影响,并查询当前未保存在主内存中的图的部分。

Although ideally we would always test with a production-sized dataset, it is often not possible or desirable to reproduce extremely large volumes of data in a test environment. In such cases, we should at least ensure that we build a representative dataset whose size exceeds our capacity to hold the entire graph in main memory. That way, we’ll be able to observe the effect of cache evictions, and query for portions of the graph not currently held in main memory.

代表性数据集也有助于容量规划。无论我们创建全尺寸数据集,还是按我们预期生产图的缩小样本,我们的代表性数据集都会为我们提供一些有用的数字,用于估计磁盘上生产数据的大小。然后,这些数字可以帮助我们规划为页面缓存和 Java虚拟机(JVM)堆(有关更多详细信息,请参阅“容量规划”)。

Representative datasets also help with capacity planning. Whether we create a full-sized dataset, or a scaled-down sample of what we expect the production graph to be, our representative dataset will give us some useful figures for estimating the size of the production data on disk. These figures then help us plan how much memory to allocate to the page caches and the Java virtual machine (JVM) heap (see “Capacity Planning” for more details).

在以下示例中,我们使用名为Neode的数据集构建器来构建示例社交网络:

In the following example, we’re using a dataset builder called Neode to build a sample social network:

private void createSampleDataset( GraphDatabaseService db )
{
    DatasetManager dsm = new DatasetManager( db, SysOutLog.INSTANCE );

    // User node specification
    NodeSpecification userSpec =
        dsm.nodeSpecification( "User",
            indexableProperty( db, "User", "name" ) );

    // FRIEND relationship specification
    RelationshipSpecification friend =
        dsm.relationshipSpecification( "FRIEND" );

    Dataset dataset =
        dsm.newDataset( "Social network example" );

    // Create user nodes
    NodeCollection users =
        userSpec.create( 1_000_000 ).update( dataset );


    // Relate users to each other
    users.createRelationshipsTo(
        getExisting( users )
            .numberOfTargetNodes( minMax( 50, 100 ) )
            .relationship( friend )
            .relationshipConstraints( RelationshipUniqueness.BOTH_DIRECTIONS ) )
        .updateNoReturn( dataset );

    dataset.end();
}
private void createSampleDataset( GraphDatabaseService db )
{
    DatasetManager dsm = new DatasetManager( db, SysOutLog.INSTANCE );

    // User node specification
    NodeSpecification userSpec =
        dsm.nodeSpecification( "User",
            indexableProperty( db, "User", "name" ) );

    // FRIEND relationship specification
    RelationshipSpecification friend =
        dsm.relationshipSpecification( "FRIEND" );

    Dataset dataset =
        dsm.newDataset( "Social network example" );

    // Create user nodes
    NodeCollection users =
        userSpec.create( 1_000_000 ).update( dataset );


    // Relate users to each other
    users.createRelationshipsTo(
        getExisting( users )
            .numberOfTargetNodes( minMax( 50, 100 ) )
            .relationship( friend )
            .relationshipConstraints( RelationshipUniqueness.BOTH_DIRECTIONS ) )
        .updateNoReturn( dataset );

    dataset.end();
}

Neode 使用节点和关系规范来描述图中的节点和关系,以及它们的属性和允许的属性值。然后,Neode 提供了一个流畅的界面来创建和相关节点。

Neode uses node and relationship specifications to describe the nodes and relationships in the graph, together with their properties and permitted property values. Neode then provides a fluent interface for creating and relating nodes.

容量规划

Capacity Planning

在应用程序开发生命周期的某个阶段,我们会希望开始规划生产部署。在许多情况下,组织的项目管理门控流程意味着,如果不了解应用程序的生产需求,项目就无法启动。容量规划对于预算目的以及确保有足够的准备时间采购硬件和预留生产资源都至关重要。

At some point in our application’s development life cycle we’ll want to start planning for production deployment. In many cases, an organization’s project management gating processes mean a project cannot get underway without some understanding of the production needs of the application. Capacity planning is essential both for budgeting purposes and for ensuring there is sufficient lead time for procuring hardware and reserving production resources.

在本节中,我们将介绍一些可用于硬件规模确定和容量规划的技术。我们估计生产需求的能力取决于许多因素。我们拥有的关于代表性图形大小、查询性能以及预期用户数量及其行为的数据越多,我们估计硬件需求的能力就越强。我们可以在应用程序开发生命周期的早期应用“测试”中描述的技术来获得大量此类信息。此外,我们应该了解在业务需求背景下可供我们使用的成本/性能权衡。

In this section we describe some of the techniques we can use for hardware sizing and capacity planning. Our ability to estimate our production needs depends on a number of factors. The more data we have regarding representative graph sizes, query performance, and the number of expected users and their behaviors, the better our ability to estimate our hardware needs. We can gain much of this information by applying the techniques described in “Testing” early in our application development life cycle. In addition, we should understand the cost/performance trade-offs available to us in the context of our business needs.

优化标准

Optimization Criteria

在规划生产环境时,我们将面临许多优化选择。我们青睐哪种选择取决于我们的业务需求:

As we plan our production environment we will be faced with a number of optimization choices. Which we favor will depend upon our business needs:

成本
我们可以通过安装完成工作所需的最少硬件来优化成本。
We can optimize for cost by installing the minimum hardware necessary to get the job done.
表现
我们可以通过采购最快的解决方案来优化性能(受预算限制)。
We can optimize for performance by procuring the fastest solution (subject to budgetary constraints).
冗余
我们可以通过确保数据库集群足够大以承受一定数量的机器故障(即,为了承受两台机器故障,我们需要一个由五个实例组成的集群)来优化冗余和可用性。
We can optimize for redundancy and availability by ensuring the database cluster is big enough to survive a certain number of machine failures (i.e., to survive two machines failing, we will need a cluster comprising five instances).
加载
复制图形数据库解决方案,我们可以通过水平扩展(对于读取负载)和垂直扩展(对于写入负载)来优化负载。
With a replicated graph database solution, we can optimize for load by scaling horizontally (for read load) and vertically (for write load).

表现

Performance

冗余和负载可以根据确保可用性所需的机器数量(例如,在两台机器发生故障的情况下,需要五台机器来提供持续可用性)和可扩展性(每处理一定数量的并发请求需要一台机器,如“负载”中的计算)来计算成本。但性能呢?我们如何计算性能成本?

Redundancy and load can be costed in terms of the number of machines necessary to ensure availability (five machines to provide continued availability in the face of two machines failing, for example) and scalability (one machine per some number of concurrent requests, as per the calculations in “Load”). But what about performance? How can we cost performance?

计算图形数据库性能成本

Calculating the cost of graph database performance

为了理解优化性能的成本影响,我们需要了解数据库堆栈的性能特征。正如我们稍后在“原生图形存储”中更详细地描述的那样,图形数据库使用磁盘进行持久存储,并使用主内存来缓存图形的各个部分。

In order to understand the cost implications of optimizing for performance, we need to understand the performance characteristics of the database stack. As we describe in more detail later in “Native Graph Storage”, a graph database uses disk for durable storage, and main memory for caching portions of the graph.

旋转磁盘价格便宜,但随机寻道速度不是很快(现代磁盘大约需要 6 毫秒)。必须一直到达旋转磁盘的查询将比仅触及内存中图形部分的查询慢几个数量级。可以使用固态硬盘 (SSD) 代替旋转磁盘来改善磁盘访问,从而将性能提高约 20 倍,或者使用企业级闪存硬件,这可以进一步降低延迟

Spinning disks are cheap, but not very fast for random seeks (around 6ms for a modern disk). Queries that have to reach all the way down to spinning disk will be orders of magnitude slower than queries that touch only an in-memory portion of the graph. Disk access can be improved by using solid-state drives (SSDs) in place of spinning disks, providing an approximate 20-fold increase in performance, or by using enterprise flash hardware, which can reduce latencies even further.


笔记

对于那些图中数据大小远远超过可用 RAM 量(以及缓存量)的部署,SSD 是一个很好的选择,因为它们没有与旋转磁盘相关的机械损失。

For those deployments where the size of the data in the graph vastly eclipses the amount of RAM (and therefore cache) available, SSDs are an excellent choice, because they don’t have the mechanical penalties associated with spinning disks.


性能优化选项

Performance optimization options

那么,我们可以从三个方面进行性能优化:

There are, then, three areas in which we can optimize for performance:

  • 增加 JVM 堆大小。
  • Increase the JVM heap size.
  • 增加映射到页面缓存的存储百分比。
  • Increase the percentage of the store mapped into the page caches.
  • 投资更快的磁盘:SSD 或企业闪存硬件。
  • Invest in faster disks: SSDs or enterprise flash hardware.

如图4-11所示,成本与性能权衡的最佳点在于,我们可以将存储文件全部映射到页面缓存中,同时允许使用健康但大小适中的堆。4 到 8 GB 之间的堆并不罕见,但在许多情况下,较小的堆实际上可以提高性能(通过减轻昂贵的 GC 行为)。

As Figure 4-11 shows, the sweet spot for any cost versus performance trade-off lies around the point where we can map our store files in their entirety into the page cache, while allowing for a healthy, but modestly sized heap. Heaps of between 4 and 8 GB are not uncommon, though in many cases, a smaller heap can actually improve performance (by mitigating expensive GC behaviors).

计算要分配给堆和页面缓存的 RAM 量取决于我们对图的预计大小的了解。在应用程序开发生命周期的早期构建一个代表性数据集将为我们提供进行计算所需的一些数据。如果我们无法将整个图放入主内存中,则应考虑缓存分片(请参阅“缓存分片”)。

Calculating how much RAM to allocate to the heap and the page cache depends on our knowing the projected size of our graph. Building a representative dataset early in our application’s development life cycle will furnish us with some of the data we need to make our calculations. If we cannot fit the entire graph into main memory, we should consider cache sharding (see “Cache sharding”).


笔记

有关更多常规性能和调整技巧,请参阅此站点

For more general performance and tuning tips, see this site.


格德布 0411
图 4-11.成本与性能的权衡

在优化图形数据库解决方案的性能时,我们应该牢记以下准则:

In optimizing a graph database solution for performance, we should bear in mind the following guidelines:

  • 我们应该尽可能地利用页面缓存;如果可能的话,我们应该将我们的存储文件全部映射到这个缓存中。
  • We should utilize the page cache as much as possible; if possible, we should map our store files in their entirety into this cache.
  • 我们应该在监控垃圾收集的同时调整 JVM 堆以确保行为顺畅。
  • We should tune the JVM heap while monitoring garbage collection to ensure smooth behavior.
  • 当磁盘访问不可避免时,我们应该考虑使用快速磁盘(SSD 或企业闪存硬件)来提高基线性能。
  • We should consider using fast disks — SSDs or enterprise flash hardware — to boost baseline performance when disk access becomes inevitable.

冗余

Redundancy

规划冗余需要我们确定在保持应用程序正常运行的情况下,我们可以承受集群中丢失多少个实例。对于非业务关键型应用程序,这个数字可能低至 1(甚至为 0)。一旦第一个实例发生故障,另一个故障将导致应用程序不可用。业务关键型应用程序可能需要至少两个冗余;也就是说,即使两台机器发生故障,应用程序仍会继续处理请求。

Planning for redundancy requires us to determine how many instances in a cluster we can afford to lose while keeping the application up and running. For non–business-critical applications, this figure might be as low as one (or even zero). Once a first instance has failed, another failure will render the application unavailable. Business-critical applications will likely require redundancy of at least two; that is, even after two machines have failed, the application continues serving requests.

对于集群管理协议要求大多数成员可用才能正常工作的图形数据库,使用三或四个实例即可实现 1 的冗余,使用五个实例即可实现 2 的冗余。在这方面,四个并不比三个好,因为如果四实例集群中的两个实例不可用,其余的协调器将不再能够实现多数。

For a graph database whose cluster management protocol requires a majority of members to be available to work properly, redundancy of one can be achieved with three or four instances, and redundancy of two with five instances. Four is no better than three in this respect, because if two instances from a four-instance cluster become unavailable, the remaining coordinators will no longer be able to achieve majority.

加载

Load

优化负载可能是容量规划中最棘手的部分。经验法则如下:

Optimizing for load is perhaps the trickiest part of capacity planning. As a rule of thumb:

并发请求数 = (1000 / 平均请求时间(毫秒)) * 每台机器的核心数 * 机器数量
number of concurrent requests = (1000 / average request time (in milliseconds)) * number of cores per machine * number of machines

实际上确定这些数字是多少或预计是多少有时会非常困难:

Actually determining what some of these figures are, or are projected to be, can sometimes be very difficult:

平均请求时间
涵盖了从服务器收到请求到发送响应的时间段。性能测试可以帮助确定平均请求时间,假设测试是在代表性硬件上运行的,并且针对代表性数据集(如果不是,我们将不得不相应地进行对冲)。在许多情况下,“代表性数据集”本身是基于粗略估计的;每当这个估计发生变化时,我们都应该修改我们的数字。
This covers the period from when a server receives a request, to when it sends a response. Performance tests can help determine average request time, assuming the tests are running on representative hardware against a representative dataset (we’ll have to hedge accordingly if not). In many cases, the “representative dataset” itself is based on a rough estimate; we should modify our figures whenever this estimate changes.
并发请求数
我们应该在这里区分平均负载和峰值负载。确定新应用程序必须支持的并发请求数是一件困难的事情。如果我们要替换或升级现有应用程序,我们可能会访问一些最近的生产统计数据,我们可以使用这些数据来改进我们的估算。一些组织能够从现有应用程序数据中推断出新应用程序的可能要求。除此之外,由我们的利益相关者来估计系统的预计负载,但我们必须注意过高的期望。
We should distinguish here between average load and peak load. Determining the number of concurrent requests a new application must support is a difficult thing to do. If we’re replacing or upgrading an existing application, we may have access to some recent production statistics we can use to refine our estimates. Some organizations are able to extrapolate from existing application data the likely requirements for a new application. Other than that, it’s up to our stakeholders to estimate the projected load on the system, but we must beware of inflated expectations.

导入和批量加载数据

Importing and Bulk Loading Data

任何类型的数据库部署中,大多数(如果不是大多数)都不会从空存储开始。作为部署新数据库的一部分,我们可能还需要从旧平台迁移数据,需要来自某些第三方系统的主数据,或者只是将测试数据(例如本章示例中的数据)导入到原本空的存储中。随着时间的推移,我们可能必须在实时存储上从上游系统执行其他批量加载操作。

Many if not most deployments of any kind of database don’t start out with an empty store. As part of deploying the new database, we may also have data to migrate from a legacy platform, require master data from some third party system, or be merely importing test data — such as the data in the examples in this chapter — into an otherwise empty store. As time goes on, we may have to perform other bulk loading operations from upstream systems on a live store.

Neo4j 提供了实现这些目标的工具,包括针对初始批量加载和正在进行的批量导入场景,使我们能够将来自各种其他来源的数据流入图表中。

Neo4j provides tooling to achieve these goals, both for the initial bulk load and ongoing bulk import scenarios, allowing us to stream data from a variety of other sources into the graph.

初始导入

Initial Import

对于初始导入,Neo4j 有一个名为的初始加载工具neo4j-import,该工具可实现每秒约 1,000,000 条记录的持续摄取速度。8之所以能实现这些令人印象深刻的性能数字,是因为它不使用数据库的常规事务功能来构建存储文件。相反,它以类似栅格的方式构建存储文件,添加各个图层直到存储完成,并且只有在完成时,存储才会变得一致

For initial imports Neo4j has an initial load tool called neo4j-import, which achieves sustained ingest speeds of around 1,000,000 records per second.8 It achieves these impressive performance figures because it does not build the store files using the normal transactional capabilities of the database. Instead, it builds the store files in a raster-like fashion, adding individual layers until the store is complete, and it is only at completion that the store becomes consistent.

该工具的输入neo4j-import是一组提供节点和关系数据的 CSV 文件。例如,请考虑以下三个 CSV 文件,它们代表一个小的电影数据集。

The input to the neo4j-import tool is a set of CSV files that provide node and relationship data. As an example, consider the following three CSV files, which represent a small movie data set.

第一个文件是movies.csv

The first file is movies.csv:

:ID,标题,年份:int,:LABEL
1、《黑客帝国》,1999年,电影
2、《黑客帝国2:重装上阵》2003年电影;续集
3、《黑客帝国3:矩阵革命》2003年电影续集
:ID,title,year:int,:LABEL
1,"The Matrix",1999,Movie
2,"The Matrix Reloaded",2003,Movie;Sequel
3,"The Matrix Revolutions",2003,Movie;Sequel

第一个文件代表电影本身。文件的第一行包含描述电影的元数据。在本例中,我们可以看到每部电影都有一个ID、一个titleyear(一个整数)。该ID字段充当键。导入的其他部分可以使用其来引用电影ID。电影还有一个或多个标签:MovieSequel

This first file represents the movies themselves. The first line of the file contains metadata describing the movies. In this case, we can see that each movie has an ID, a title, and a year (which is an integer). The ID field acts as a key. Other parts of the import can refer to a movie using its ID. Movies also have one or more labels: Movie and Sequel.

第二个文件actors.csv包含电影演员。我们可以看到,actors 有一个IDandname属性和一个Actor标签:

The second file, actors.csv, contains movie actors. As we can see, actors have an ID and name property, and an Actor label:

:ID,名称,:标签
基努,“基努·里维斯”,演员
laurence,“劳伦斯·菲什伯恩”,演员
carrieanne,“凯莉-安·莫斯”,演员
:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor

第三个文件roles.csv指定了演员在电影中扮演的角色。此文件用于创建图中的关系:

The third file, roles.csv, specifies the roles that actors played in the movies. This file is used to create the relationships in the graph:

:START_ID,角色,:END_ID,:类型
基努,“Neo”,1,ACTS_IN
基努,“Neo”,2,ACTS_IN
基努,“Neo”,3,ACTS_IN
劳伦斯,“莫菲斯”,1,ACTS_IN
劳伦斯,“莫菲斯”,2,ACTS_IN
劳伦斯,“莫菲斯”,3,ACTS_IN
carrieanne,"三位一体",1,ACTS_IN
carrieanne,“三位一体”,2,ACTS_IN
carrieanne,“三位一体”,3,ACTS_IN
:START_ID,role,:END_ID,:TYPE
keanu,"Neo",1,ACTS_IN
keanu,"Neo",2,ACTS_IN
keanu,"Neo",3,ACTS_IN
laurence,"Morpheus",1,ACTS_IN
laurence,"Morpheus",2,ACTS_IN
laurence,"Morpheus",3,ACTS_IN
carrieanne,"Trinity",1,ACTS_IN
carrieanne,"Trinity",2,ACTS_IN
carrieanne,"Trinity",3,ACTS_IN

此文件中的每一行都包含一个START_ID和一个END_ID、一个role值和一个关系TYPESTART_ID值包括ID来自演员 CSV 文件中的演员值。END_ID值包括ID来自电影 CSV 文件中的电影值。每个关系都表示为一个START_ID和一个END_ID,具有一个role属性和一个从关系派生的名称TYPE

Each line in this file contains a START_ID and an END_ID, a role value and a relationship TYPE. START_ID values comprise actor ID values from the actors CSV file. END_ID values comprise movie ID values from the movies CSV file. Each relationship is expressed as a START_ID and an END_ID, with a role property, and a name derived from the relationship TYPE.

有了这些文件,我们可以从命令行运行导入工具:

With these files, we can run the import tool from the command line:

neo4j-导入--进入目标目录\
--节点电影.csv --节点演员.csv --关系角色.csv
neo4j-import --into target_directory \
--nodes movies.csv --nodes actors.csv --relationships roles.csv

neo4j-import构建数据库存储文件,并将其放入target_directory

neo4j-import builds the database store files, and puts them in the target_directory.

批量导入

Batch Import

另一个常见要求是将来自外部系统的批量数据推送到实时图表中。在 Neo4j 中,这通常使用 Cypher 的LOAD CSV命令来执行。LOAD CSV以与我们在该工具中使用的相同类型的 CSV 数据作为输入neo4j-import。它旨在支持大约一百万个项目的中间负载,使其成为处理来自上游系统的定期批量更新的理想选择。

Another common requirement is to push bulk data from external systems into a live graph. In Neo4j this is commonly performed using Cypher’s LOAD CSV command. LOAD CSV takes as input the same kind of CSV data we used with the neo4j-import tool. It is designed to support intermediate loads of around a million or so items, making it ideal for handling regular batch updates from upstream systems.

举个例子,让我们用一些关于设定位置的数据来丰富我们现有的电影图表。locations.csv包含和字段,其中是用分号分隔的电影拍摄地点列表titlelocationlocation

As an example, let’s enrich our existing movie graph with some data about set locations. locations.csv contains title and location fields, where location is a semi-colon-separated list of filming locations in the movie:

标题,地点
《黑客帝国》,悉尼
《黑客帝国2:重装上阵》,悉尼;奥克兰
《黑客帝国3》,悉尼;奥克兰;阿拉米达
title,locations
"The Matrix",Sydney
"The Matrix Reloaded",Sydney;Oakland
"The Matrix Revolutions",Sydney;Oakland;Alameda

有了这些数据,我们可以使用 Cypher 命令将其加载到实时 Neo4j 数据库中,LOAD CSV如下所示:

Given this data, we can load it into a live Neo4j database using the Cypher LOAD CSV command as follows:

LOAD CSV WITH HEADERS FROM 'file:///data/locations.csv' AS line
WITH split(line.locations,";") as locations, line.title as title
UNWIND locations AS location
MERGE (x:Location {name:location})
MERGE (m:Movie {title:title})
MERGE (m)-[:FILMED_IN]->(x)
LOAD CSV WITH HEADERS FROM 'file:///data/locations.csv' AS line
WITH split(line.locations,";") as locations, line.title as title
UNWIND locations AS location
MERGE (x:Location {name:location})
MERGE (m:Movie {title:title})
MERGE (m)-[:FILMED_IN]->(x)

这个 Cypher 脚本的第一行告诉数据库我们想要从文件 URI(LOAD CSV也适用于 HTTP URI)加载一些 CSV 数据。WITH HEADERS告诉数据库我们的 CSV 文件的第一行包含命名标题。AS line将输入文件分配给变量line。然后,将针对源文件中的每一行 CSV 数据执行脚本的其余部分。

The first line of this Cypher script tells the database that we want to load some CSV data from a file URI (LOAD CSV also works with HTTP URIs). WITH HEADERS tells the database that the first line of our CSV file contains named headers. AS line assigns the input file to the variable line. The rest of the script will then be executed for each line of CSV data in the source file.

脚本的第二行以 开头WITH,使用 Cypher 函数将单个行的locations值拆分为字符串集合split。然后,它将生成的集合和行的title值传递给脚本的其余部分。

The second line of the script, beginning with WITH, splits an individual line’s locations value into a collection of strings using Cypher’s split function. It then passes the resulting collection and the line’s title value on to the rest of the script.

UNWIND有趣的工作就从这里开始了。UNWIND扩展一个集合。在这里,我们用它将locations集合扩展为单独的行(请记住,此时我们正在处理单个电影的拍摄地点),后面的语句location将处理每一行。MERGE

UNWIND is where the interesting work begins. UNWIND expands a collection. Here, we use it to expand the locations collection into individual location rows (remember, we’re dealing at this point with a single movie’s locations), each of which will be processed by the MERGE statements that follow.

第一条MERGE语句确保位置由数据库中的节点表示。第二条MERGE语句确保电影也作为节点存在。第三条MERGE语句确保FILMED_IN位置和电影节点之间存在关系。

The first MERGE statement ensures that the location is represented by a node in the database. The second MERGE statement ensures that the movie is also present as a node. The third MERGE statement ensures that a FILMED_IN relationship exists between the location and movie nodes.


笔记

MERGE就像MATCH和的混合CREATE。如果语句中描述的模式MERGE已存在于图中,则语句的标识符将绑定到此现有数据,就像我们指定了 一样MATCH。如果模式当前存在于图中,MERGE则将创建它,就像我们使用 一样CREATE

MERGE is like a mixture of MATCH and CREATE. If the pattern described in the MERGE statement already exists in the graph, the statement’s identifiers will be bound to this existing data, much as if we’d specified MATCH. If the pattern does not currently exist in the graph, MERGE will create it, much as if we’d used CREATE.

为了MERGE匹配现有数据,模式中的所有元素必须已存在于图中。如果它不能匹配模式的所有部分,将创建整个MERGE模式的新实例。这就是我们在脚本中使用三个语句的原因。给定一部特定的电影和一个特定的地点,它们中的一个或另一个很可能已经存在于图中。它们也可能都存在,但没有连接它们的关系。如果我们使用单个大语句而不是三个小语句:MERGELOAD CSVMERGE

For MERGE to match existing data, all the elements in the pattern must already exist in the graph. If it can’t match all parts of a pattern, MERGE will create a new instance of the entire pattern. This is why we have used three MERGE statements in our LOAD CSV script. Given a particular movie and a particular location, it’s quite possible that one or another of them is already present in the graph. It’s also possible for both of them to exist, but without a relationship connecting them. If we were to use a single, large MERGE statement instead of our three small statements:

MERGE (:Movie {title:title})-[:FILMED_IN]->
      (:Location {name:location}))
MERGE (:Movie {title:title})-[:FILMED_IN]->
      (:Location {name:location}))

只有当电影和地点节点以及它们之间的关系已经存在时,匹配才会成功。如果此模式的任何一个部分不存在,则将创建所有部分,从而导致重复数据。

the match would only succeed if the movie and location nodes and the relationship between them already exist. If any one part of this pattern does not exist, all parts will be created, leading to duplicate data.

我们的策略是将较大的模式分解成较小的块。我们首先确保位置存在。接下来我们确保电影存在。最后,我们确保两个节点已连接。使用时,这种增量方法非常正常MERGE

Our strategy is to break apart the larger pattern into smaller chunks. We first ensure that the location is present. We next ensure that the movie is present. Finally, we ensure that the two nodes are connected. This incremental approach is quite normal when using MERGE.


此时,我们能够将批量 CSV 数据插入实时图表。但是,我们尚未考虑导入的机械影响。如果我们要在现有的大型数据集上运行这样的大型查询,则插入可能会花费很长时间。为了提高导入效率,我们需要考虑导入的两个关键特征:

At this point we are able to insert bulk CSV data into a live graph. However, we have not yet considered the mechanical implications of our import. In we were to run large queries like this on an existing large dataset, it is likely that the insert would take a very long time. There are two key characteristics of import we need to consider in order to make it efficient:

  • 对现有图表进行索引
  • Indexing of the existing graph
  • 通过数据库的事务流
  • Transaction flow through the database

对于我们这些有关系背景的人来说,索引的必要性(可能)在这里是显而易见的。如果没有索引,我们必须搜索数据库中的所有电影节点(最坏的情况下是所有节点)才能确定电影是否存在。这是一个成本为O(n) 的操作。有了电影索引,这个成本就降到了O(log n),这是一个实质性的改进,尤其是对于较大的数据集而言。位置也是如此。

For those of us coming from a relational background, the need for indexing is (perhaps) obvious here. Without indexes, we have to search all movie nodes in the database (and in the worst case, all of the nodes) in order to determine whether a movie exists or not. This is a cost O(n) operation. With an index of movies, that cost drops to O(log n), which is a substantial improvement, especially for larger data sets. The same is true of locations.

正如我们在上一章中看到的,声明索引很简单。要索引电影,我们只需发出命令CREATE INDEX ON :Movie(title)。我们可以通过浏览器或使用 shell 来执行此操作。如果索引仅在导入期间有用(即它在操作查询中不起作用),那么我们在导入后使用 将其删除DROP INDEX ON :Movie(title)

Declaring an index, as we saw in the previous chapter, is straightforward. To index movies, we simply issue the command CREATE INDEX ON :Movie(title). We can do this via the browser or using the shell. If the index is useful only during import (i.e., it plays no role in operational queries) then we drop it after the import with DROP INDEX ON :Movie(title).


笔记

在某些情况下,将临时 ID 作为属性添加到节点很有用,以便在导入期间轻松引用它们,尤其是在创建关系网络时。这些 ID 没有领域意义。它们只是在多步骤导入过程的持续时间内存在,以便该过程可以找到要连接的特定节点。

In some cases it is useful to add temporary IDs as properties to nodes so they can be easily referenced during import, especially when creating networks of relationships. These IDs have no domain significance. They exist simply for the duration of a multistep import process so the process can find specific nodes to be connected.

使用临时 ID 完全有效。只需记住REMOVE在导入完成后将其删除即可。

The use of temporary IDs is perfectly valid. Just remember to remove them using REMOVE once the import is complete.


鉴于对实时 Neo4j 实例的更新是事务性的,因此批量导入LOAD CSV也是事务性的。在最简单的情况下,LOAD CSV构建一个事务并将其提供给数据库。对于较大的批量插入,这在机械上可能非常低效,因为数据库必须管理大量事务状态(有时是 GB)。

Given that updates to live Neo4j instances are transactional, it follows that batch imports with LOAD CSV are also transactional. In the simplest case, LOAD CSV builds one transaction and feeds it to the database. For larger batch insertions this can be quite inefficient mechanically because the database has to manage a large amount of transaction state (sometimes gigabytes).

对于大量数据导入,我们可以通过将单个大型事务提交分解为一系列较小的提交来提高性能,然后对数据库按顺序执行这些提交。为了实现这一点,我们使用功能PERIODIC COMMITPERIODIC COMMIT将导入分解为一组较小的事务,这些事务在处理了一定数量的行(默认为 1000 行)后提交。对于我们的电影位置数据,我们可以选择将每个事务的默认 CSV 行数减少到 100 行,例如,通过在 Cypher 脚本前面添加USING PERIODIC COMMIT 100。完整脚本为:

For large data imports, we can boost performance by breaking down a single large transactional commit into a series of smaller commits, which are then executed serially against the database. To achieve this, we use the PERIODIC COMMIT functionality. PERIODIC COMMIT breaks the import into a set of smaller transactions, which are committed after a certain number of rows (1000 by default) have been processed. With our movie location data, we could choose to reduce the default number of CSV lines per transaction to 100, for example, by prepending the Cypher script with USING PERIODIC COMMIT 100. The full script is:

USING PERIODIC COMMIT 100
LOAD CSV WITH HEADERS FROM 'file:///data/locations.csv' AS line
WITH split(line.locations,";") as locations, line.title as title
UNWIND locations AS location
MERGE (x:Location {name:location})
MERGE (m:Movie {title:title})
MERGE (m)-[:FILMED_IN]->(x)
USING PERIODIC COMMIT 100
LOAD CSV WITH HEADERS FROM 'file:///data/locations.csv' AS line
WITH split(line.locations,";") as locations, line.title as title
UNWIND locations AS location
MERGE (x:Location {name:location})
MERGE (m:Movie {title:title})
MERGE (m)-[:FILMED_IN]->(x)

这些用于加载批量数据的功能使我们既可以在设计系统时试验示例数据集,也可以在生产部署中与其他系统和数据源集成。CSV 是一种普遍的数据交换格式——几乎每种数据和集成技术都支持生成 CSV 输出。这使得将数据导入 Neo4j 变得非常容易,无论是作为一次性活动还是定期基础。

These facilities for loading bulk data allow us both to experiment with example datasets when designing a system, and integrate with other systems and sources of data as part of a production deployment. CSV is an ubiquitous data exchange format — almost every data and integration technology has some support for producing CSV output. This makes it extrememly easy to import data into Neo4j, either as a one-time activity or on a periodic basis.

概括

Summary

在本章中,我们讨论了开发图形数据库应用程序的最重要方面。我们已经了解了如何创建满足应用程序需求和最终用户目标的图形模型,以及如何使用单元和性能测试使我们的模型和相关查询具有表现力和鲁棒性。我们研究了几种不同应用程序架构的优缺点,并列举了规划生产时需要考虑的因素。

In this chapter we’ve discussed the most important aspects of developing a graph database application. We’ve seen how to create graph models that address an application’s needs and an end user’s goals, and how to make our models and associated queries expressive and robust using unit and performance tests. We’ve looked at the pros and cons of a couple of different application architectures, and enumerated the factors we need to consider when planning for production.

最后,我们研究了将批量数据快速加载到 Neo4j 的选项,包括初始导入和持续批量插入实时数据库。

Finally we looked at options for rapidly loading bulk data into Neo4j, for both initial import and ongoing batch insertion into a live database.

在下一章中,我们将研究图形数据库目前如何应用于解决社交网络、推荐、主数据管理、数据中心管理、访问控制和物流等不同领域的实际问题。

In the next chapter we’ll look at how graph databases are being used today to solve real-world problems in domains as varied as social networking, recommendations, master data management, data center management, access control, and logistics.

1有关敏捷用户故事,请参阅 Mike Cohn 的《用户故事应用》(Addison-Wesley,2004 年)。

1 For Agile user stories, see Mike Cohn, User Stories Applied (Addison-Wesley, 2004).

2双向链表非常巧妙,因为在实践中,关系可以在任一方向上以恒定的时间进行遍历。

2 A doubly linked list is a nicety, because in practice relationships can be traversed in constant time in either direction.

3例如,请参阅http://iansrobinson.com/2014/05/13/time-based-versioned-graphs/

3 See, for example, http://iansrobinson.com/2014/05/13/time-based-versioned-graphs/.

4社区开发的 Neo4j 远程客户端库列表位于http://neo4j.com/developer/language-guides/

4 A list of Neo4j remote client libraries, as developed by the community, is maintained at http://neo4j.com/developer/language-guides/.

5测试不仅可以作为文档,还可以用于生成文档。Neo4j手册中的所有 Cypher 文档都是从用于开发 Cypher 的单元测试自动生成的。

5 Tests not only act as documentation, but they can also be used to generate documentation. All of the Cypher documentation in the Neo4j manual is generated automatically from the unit tests used to develop Cypher.

6关于敏捷性能测试的详细讨论,可以在 Alistair Jones 和 Patrick Kua 的论文“极端性能测试”中找到,该论文收录于 ThoughtWorks 选集第 2 卷(Pragmatic Bookshelf,2012 年)。

6 A thorough discussion of agile performance testing can be found in Alistair Jones and Patrick Kua’s essay “Extreme Performance Testing,” in The ThoughtWorks Anthology, Volume 2 (Pragmatic Bookshelf, 2012).

7 Max De Marzi 描述使用 Gatling 测试 Neo4j

7 Max De Marzi describes using Gatling to test Neo4j.

8使用 Neo4j 2.2 版本及更高版本提供的工具的新实现。

8 Using a new implementation of the tool available from version Neo4j 2.2 onwards.

第 5 章现实世界中的图表

Chapter 5. Graphs in the Real World

在本章中,我们将介绍图形数据库的一些常见实际用例,并找出组织选择使用图形数据库而不是关系数据库或其他 NOSQL 存储的原因。本章的主要内容包括三个深入的用例,并详细介绍了相关的数据模型和查询。每个示例都取自实际的生产系统;不过,名称已被更改,技术细节在必要时被简化,以突出关键的设计点并隐藏任何意外的复杂性。

In this chapter we look at some of the common real-world use cases for graph databases and identify the reasons why organizations choose to use a graph database rather than a relational or other NOSQL store. The bulk of the chapter comprises three in-depth use cases, with details of the relevant data models and queries. Each of these examples has been drawn from a real-world production system; the names, however, have been changed, and the technical details simplified where necessary to highlight key design points and hide any accidental complexity.

为什么组织选择图形数据库

Why Organizations Choose Graph Databases

在本书中,我们赞扬了图形数据模型、其强大功能和灵活性以及其与生俱来的表现力。在将图形数据库应用于实际问题时,由于现实世界的技术和业务限制,组织出于以下原因选择图形数据库:

Throughout this book, we’ve sung the praises of the graph data model, its power and flexibility, and its innate expressiveness. When it comes to applying a graph database to a real-world problem, with real-world technical and business constraints, organizations choose graph databases for the following reasons:

“几分钟到几毫秒”的性能
查询性能和响应能力是许多组织对其数据平台最关心的问题。在线交易系统(尤其是大型 Web 应用程序)要想取得成功,必须在几毫秒内响应最终用户。在关系世界中,随着应用程序数据集大小的增长,连接问题开始显现,性能下降。使用无索引邻接,图形数据库将复杂的连接转换为快速的图形遍历,从而无论数据集的总体大小如何都能保持毫秒级的性能。
Query performance and responsiveness are at the top of many organizations’ concerns with regard to their data platforms. Online transactional systems, large web applications in particular, must respond to end users in milliseconds if they are to be successful. In the relational world, as an application’s dataset size grows, join pains begin to manifest themselves, and performance deteriorates. Using index-free adjacency, a graph database turns complex joins into fast graph traversals, thereby maintaining millisecond performance irrespective of the overall size of the dataset.
彻底加速开发周期
图形数据模型减少了困扰软件开发数十年的阻抗不匹配问题,从而减少了在对象模型和表格关系模型之间来回转换的开发开销。更重要的是,图形模型减少了技术和业务领域之间的阻抗不匹配。主题专家、架构师和开发人员可以使用共享模型来讨论和描绘核心领域,然后将该模型合并到应用程序本身中。
The graph data model reduces the impedance mismatch that has plagued software development for decades, thereby reducing the development overhead of translating back and forth between an object model and a tabular relational model. More importantly, the graph model reduces the impedance mismatch between the technical and business domains. Subject matter experts, architects, and developers can talk about and picture the core domain using a shared model that is then incorporated into the application itself.
极高的业务响应能力
成功的应用程序很少会停滞不前。业务条件、用户行为以及技术和运营基础设施的变化推动了新的需求。过去,这要求组织进行谨慎而漫长的数据迁移,包括修改架构、转换数据和维护冗余数据以服务新旧功能。图形数据库的无架构特性加上同时以多种不同方式关联数据元素的能力,使图形数据库解决方案能够随着业务的发展而发展,从而降低风险并缩短上市时间。
Successful applications rarely stay still. Changes in business conditions, user behaviors, and technical and operational infrastructures drive new requirements. In the past, this has required organizations to undertake careful and lengthy data migrations that involve modifying schemas, transforming data, and maintaining redundant data to serve old and new features. The schema-free nature of a graph database coupled with the ability to simultaneously relate data elements in lots of different ways allows a graph database solution to evolve as the business evolves, reducing risk and time-to-market.
企业就绪
什么时候在业务关键型应用程序中使用时,数据技术必须是稳健的、可扩展的,而且往往是事务性的。虽然一些图形数据库相当新,尚未完全成熟,但市场上有一些图形数据库可以提供当今大型企业所需的所有功能— ACID(原子性、一致性、隔离性、持久性)事务性、高可用性、水平读取可扩展性和数十亿个实体的存储,以及之前讨论的性能和灵活性特性。这是导致组织采用图形数据库的一个重要因素,不仅仅是在适度的离线或部门容量上,而是以真正改变业务的方式。
When employed in a business-critical application, a data technology must be robust, scalable, and more often than not, transactional. Although some graph databases are fairly new and not yet fully mature, there are graph databases on the market that provide all the -ilities — ACID (Atomic, Consistent, Isolated, Durable) transactionality, high-availability, horizontal read scalability, and storage of billions of entities — needed by large enterprises today, as well as the previously discussed performance and flexibility characteristics. This has been an important factor leading to the adoption of graph databases by organizations, not merely in modest offline or departmental capacities, but in ways that can truly change the business.

常见用例

Common Use Cases

在本节中,我们描述一些最常见的图形数据库用例,确定如何应用图形模型和图形数据库的具体特征来产生竞争洞察力和显著的商业价值。

In this section we describe some of the most common graph database use cases, identifying how the graph model and the specific characteristics of the graph database can be applied to generate competitive insight and significant business value.

社会的

Social

我们才刚刚开始发现社交数据的力量。社会科学家尼古拉斯·克里斯塔基斯和詹姆斯·福勒在他们的著作《互联》中展示了如何通过了解一个人的行为来预测他的行为他和谁有联系。1

We are only just beginning to discover the power of social data. In their book Connected, social scientists Nicholas Christakis and James Fowler show how we can predict a person’s behavior by understanding who he is connected to.1

社交应用可让组织利用有关人与人之间联系的信息来获得竞争优势和运营优势。通过整合有关个人及其关系的离散信息,组织能够促进协作、管理信息和预测行为。

Social applications allow organizations to gain competitive and operational advantage by leveraging information about the connections between people. By combining discrete information about individuals and their relationships, organizations are able to to facilitate collaboration, manage information, and predict behavior.

正如 Facebook 的社交图谱一词的使用意味着,图谱数据模型和图谱数据库非常适合这个以关系为中心的领域。社交网络帮助我们识别人、群体和他们互动的事物之间的直接间接关系,使用户可以对彼此和他们关心的事物进行评分、评论和发现。通过了解谁与谁互动、人们如何联系以及群体中的代表根据群体的总体行为可能会做什么或选择什么,我们可以深入了解影响个人行为的看不见的力量。我们将在“图论与预测模型”中更详细地讨论预测模型及其在社交网络分析中的作用。

As Facebook’s use of the term social graph implies, graph data models and graph databases are a natural fit for this overtly relationship-centered domain. Social networks help us identify the direct and indirect relationships between people, groups, and the things with which they interact, allowing users to rate, review, and discover each other and the things they care about. By understanding who interacts with whom, how people are connected, and what representatives within a group are likely to do or choose based on the aggregate behavior of the group, we generate tremendous insight into the unseen forces that influence individual behaviors. We discuss predictive modeling and its role in social network analysis in more detail in “Graph Theory and Predictive Modeling”.

社会关系可以是显性的,也可以是隐性的。显性关系发生在社交主体主动建立直接联系的地方——例如,通过在 Facebook 上点赞某人,或者表明某人是现任或前任同事,就像在LinkedIn。隐性关系源自通过中介间接连接两个或多个主体的其他关系。我们可以根据主体的观点、喜好、购买行为,甚至日常工作成果来建立联系。这种间接关系可以以多种暗示和推理方式应用。我们可以说,基于一些共同的中介,A 可能认识、喜欢或以其他方式与 B 联系。这样一来,我们就从社交网络分析进入了推荐引擎领域。

Social relations may be either explicit or implicit. Explicit relations occur wherever social subjects volunteer a direct link — by liking someone on Facebook, for example, or indicating someone is a current or former colleague, as happens on LinkedIn. Implicit relations emerge out of other relationships that indirectly connect two or more subjects by way of an intermediary. We can relate subjects based on their opinions, likes, purchases, and even the products of their day-to-day work. Such indirect relationships lend themselves to being applied in multiple suggestive and inferential ways. We can say that A is likely to know, like, or otherwise connect to B based on some common intermediaries. In so doing, we move from social network analysis into the realm of recommendation engines.

建议

Recommendations

有效的推荐是通过应用推理或暗示功能来创造最终用户价值的典型示例。业务线应用程序通常应用演绎和精确算法(计算工资单、应用税款等)来创造最终用户价值,而推荐算法则是归纳和暗示性的,可识别个人或团体可能感兴趣的人、产品或服务。

Effective recommendations are a prime example of generating end-user value through the application of an inferential or suggestive capability. Whereas line-of-business applications typically apply deductive and precise algorithms — calculating payroll, applying tax, and so on — to generate end-user value, recommendation algorithms are inductive and suggestive, identifying people, products, or services an individual or group is likely to have some interest in.

推荐算法建立人与事物之间的关系:其他人、产品、服务、媒体内容 — 任何与推荐所用领域相关的事物。关系的建立基于用户在购买、生产、消费、评级或评论相关资源时的行为。然后,推荐引擎可以识别特定个人或团体感兴趣的资源,或者可能对特定资源感兴趣的个人和团体。第一种方法是识别特定用户感兴趣的资源,将相关用户的行为 — 她的购买行为、表达的偏好以及评级和评论中表达的态度 — 与其他用户的行为相关联,以识别相似用户,然后识别与他们相关的事物。第二种方法是识别特定资源的用户和群组,重点关注相关资源的特征。然后,引擎识别相似资源以及与这些资源相关的用户。

Recommendation algorithms establish relationships between people and things: other people, products, services, media content — whatever is relevant to the domain in which the recommendation is employed. Relationships are established based on users’ behaviors as they purchase, produce, consume, rate, or review the resources in question. The recommendation engine can then identify resources of interest to a particular individual or group, or individuals and groups likely to have some interest in a particular resource. With the first approach, identifying resources of interest to a specific user, the behavior of the user in question — her purchasing behavior, expressed preferences, and attitudes as expressed in ratings and reviews — are correlated with those of other users in order to identify similar users and thereafter the things with which they are connected. The second approach, identifying users and groups for a particular resource, focuses on the characteristics of the resource in question. The engine then identifies similar resources, and the users associated with those resources.

与社交用例一样,做出有效的推荐取决于对事物之间联系的理解,以及这些联系的质量和强度——所有这些都最好用属性图来表示。查询主要是本地图,因为它们从一个或多个可识别的主题(无论是人还是资源)开始,然后发现图的周围部分。

As in the social use case, making an effective recommendation depends on understanding the connections between things, as well as the quality and strength of those connections — all of which are best expressed as a property graph. Queries are primarily graph local, in that they start with one or more identifiable subjects, whether people or resources, and thereafter discover surrounding portions of the graph.

总的来说,社交网络和推荐引擎在零售、招聘、情绪分析、搜索和知识管理等领域提供了关键的差异化功能。图形非常适合与这些领域相关的密集连接数据结构。使用图形数据库存储和查询这些数据允许应用程序向最终用户显示反映数据最近变化的实时结果,而不是预先计算的陈旧结果。

Taken together, social networks and recommendation engines provide key differentiating capabilities in the areas of retail, recruitment, sentiment analysis, search, and knowledge management. Graphs are a good fit for the densely connected data structures germane to each of these areas. Storing and querying this data using a graph database allows an application to surface end-user real-time results that reflect recent changes to the data, rather than precalculated, stale results.

地理

Geo

地理空间是最初的图形用例。欧拉通过提出一个数学定理解决了柯尼斯堡七桥问题,该定理后来成为图论的基础。图形数据库的地理空间应用范围很广,从计算抽象网络(如公路或铁路网络、空域网络或物流网络(如本章后面的物流示例所示))中位置之间的路线,到空间操作(如查找有界区域内的所有兴趣点、查找区域中心以及计算两个或多个区域之间的交点)。

Geospatial is the original graph use case. Euler solved the Seven Bridges of Königsberg problem by positing a mathematical theorem that later came to form the basis of graph theory. Geospatial applications of graph databases range from calculating routes between locations in an abstract network such as a road or rail network, airspace network, or logistical network (as illustrated by the logistics example later in this chapter) to spatial operations such as find all points of interest in a bounded area, find the center of a region, and calculate the intersection between two or more regions.

地理空间操作依赖于特定的数据结构,从简单的加权和有向关系到空间索引(如R 树),后者使用树数据结构表示多维属性。作为索引,这些数据结构自然采用图形的形式,通常为分层形式,因此非常适合图形数据库。由于图形数据库的无模式特性,地理空间数据可以与其他类型的数据(例如社交网络数据)一起驻留在数据库中,从而实现跨多个域的复杂多维查询。2

Geospatial operations depend upon specific data structures, ranging from simple weighted and directed relationships, through to spatial indexes, such as R-Trees, which represent multidimensional properties using tree data structures. As indexes, these data structures naturally take the form of a graph, typically hierarchical in form, and as such they are a good fit for a graph database. Because of the schema-free nature of graph databases, geospatial data can reside in the database alongside other kinds of data — social network data, for example — allowing for complex multidimensional querying across several domains.2

图形数据库的地理空间应用与电信、物流、旅行、时间表和路线规划领域尤为相关。

Geospatial applications of graph databases are particularly relevant in the areas of telecommunications, logistics, travel, timetabling, and route planning.

主数据管理

Master Data Management

主数据是企业运营的关键数据,但其本身是非事务性的。主数据包括有关用户、客户、产品、供应商、部门、地理位置、站点、成本中心和业务部门的数据。在大型组织中,这些数据通常保存在许多不同的地方,存在大量重叠和冗余,格式各异,质量和访问方式也各不相同。主数据管理(MDM) 是识别、清理、存储和(最重要的是)管理这些数据的实践。其主要关注点包括随着组织结构的变化、业务合并和业务规则的变化而管理随时间推移而发生的变化;整合新的数据源;用外部来源的数据补充现有数据;满足报告、合规性和商业智能消费者的需求;以及随着数据值和架构的变化对数据进行版本控制。

Master data is data that is critical to the operation of a business, but which itself is nontransactional. Master data includes data concerning users, customers, products, suppliers, departments, geographies, sites, cost centers, and business units. In large organizations, this data is often held in many different places, with lots of overlap and redundancy, in several different formats, and with varying degrees of quality and means of access. Master Data Management (MDM) is the practice of identifying, cleaning, storing, and, most importantly, governing this data. Its key concerns include managing change over time as organizational structures change, businesses merge, and business rules change; incorporating new sources of data; supplementing existing data with externally sourced data; addressing the needs of reporting, compliance, and business intelligence consumers; and versioning data as its values and schemas change.

图形数据库不一定提供完整的 MDM 解决方案。但是,它们非常适合用于层次结构、主数据元数据和主数据模型的建模、存储和查询。此类模型包括类型定义、约束、实体之间的关系以及模型与底层源系统之间的映射。图形数据库的结构化但无模式的数据模型提供了临时、可变和异常结构(当存在多个冗余数据源时通常出现的模式异常),同时允许主数据模型根据不断变化的业务需求快速发展。

Graph databases don’t necessarily provide a full MDM solution. They are, however, ideally applied to the modeling, storing, and querying of hierarchies, master data metadata, and master data models. Such models include type definitions, constraints, relationships between entities, and the mappings between the model and the underlying source systems. A graph database’s structured yet schema-free data model provides for ad hoc, variable, and exceptional structures — schema anomalies that commonly arise when there are multiple redundant data sources — while at the same time allowing for the rapid evolution of the master data model in line with changing business needs.

网络和数据中心管理

Network and Data Center Management

第 3 章中,我们研究了一个简单的数据中心域模型,展示了如何用图形轻松地对数据中心内的物理和虚拟资产进行建模。通信网络是图形结构。因此,图形数据库非常适合对此类域数据进行建模、存储和查询。大型通信网络的网络管理与数据中心管理之间的区别很大程度上取决于您在防火墙的哪一侧工作。无论出于何种目的,这两件事都是一回事。

In Chapter 3 we looked at a simple data center domain model, showing how the physical and virtual assets inside a data center can be easily modeled with a graph. Communications networks are graph structures. Graph databases are, therefore, a great fit for modeling, storing, and querying this kind of domain data. The distinction between network management of a large communications network versus data center management is largely a matter of which side of the firewall you’re working. For all intents and purposes, these two things are one and the same.

网络的图形表示使我们能够对资产进行分类,直观地了解它们的部署方式,并识别它们之间的依赖关系。图形的连接结构与 Cypher 等查询语言相结合,使我们能够进行复杂的影响分析,回答以下问题:

A graph representation of a network enables us to catalog assets, visualize how they are deployed, and identify the dependencies between them. The graph’s connected structure, together with a query language like Cypher, enable us to conduct sophisticated impact analyses, answering questions such as:

  • 重要客户依赖网络的哪些部分(哪些应用程序、服务、虚拟机、物理机、数据中心、路由器、交换机和光纤)?(自上而下的分析)
  • Which parts of the network — which applications, services, virtual machines, physical machines, data centers, routers, switches, and fiber — do important customers depend on? (Top-down analysis)
  • 相反,如果某个网络元素(例如路由器或交换机)发生故障,网络中的哪些应用程序和服务,以及最终哪些客户会受到影响?(自下而上的分析)
  • Conversely, which applications and services, and ultimately, customers, in the network will be affected if a particular network element — a router or switch, for example — fails? (Bottom-up analysis)
  • 对于最重要的客户,整个网络是否具有冗余?
  • Is there redundancy throughout the network for the most important customers?

图形数据库解决方案是对现有网络管理和分析工具的补充。与主数据管理一样,它们可用于汇集来自不同库存系统的数据,提供网络及其消费者的单一视图,从最小的网络元素一直到应用程序和服务以及使用它们的客户。网络的图形数据库表示还可用于丰富基于事件关联的运营情报。每当事件关联引擎(例如复杂事件处理器)从低级网络事件流中推断出复杂事件时,它都可以使用图形模型评估该事件的影响,然后触发任何必要的补偿或缓解措施。

Graph database solutions complement existing network management and analysis tools. As with master data management, they can be used to bring together data from disparate inventory systems, providing a single view of the network and its consumers, from the smallest network element all the way to application and services and the customers who use them. A graph database representation of the network can also be used to enrich operational intelligence based on event correlations. Whenever an event correlation engine (a Complex Event Processor, for example) infers a complex event from a stream of low-level network events, it can assess the impact of that event using the graph model, and thereafter trigger any necessary compensating or mitigating actions.

如今,图形数据库已成功应用于电信、网络管理和分析、云平台管理、数据中心和 IT 资产管理以及网络影响分析等领域,将影响分析和问题解决时间从几天和几小时缩短到几分钟和几秒。性能、面对不断变化的网络架构的灵活性以及对领域的适应性都是这里的重要因素。

Today, graph databases are being successfully employed in the areas of telecommunications, network management and analysis, cloud platform management, data center and IT asset management, and network impact analysis, where they are reducing impact analysis and problem resolution times from days and hours down to minutes and seconds. Performance, flexibility in the face of changing network schemas, and fit for the domain are all important factors here.

授权和访问控制(通信)

Authorization and Access Control (Communications)

授权和访问控制解决方案存储有关各方(例如管理员、组织单位、最终用户)和资源(例如文件、共享、网络设备、产品、服务、协议)的信息,以及管理对这些资源的访问的规则。然后,他们应用这些规则来确定谁可以访问或操作资源。传统上,访问控制是使用目录服务或在应用程序的后端内构建自定义解决方案来实现的。然而,分层目录结构无法应对多方分布式供应链所特有的非分层组织和资源依赖结构。手动解决方案,特别是在关系数据库上开发的解决方案,随着数据集大小的增长,会遭受连接困难,变得缓慢且无响应,最终带来糟糕的最终用户体验。

Authorization and access control solutions store information about parties (e.g., administrators, organizational units, end-users) and resources (e.g., files, shares, network devices, products, services, agreements), together with the rules governing access to those resources. They then apply these rules to determine who can access or manipulate a resource. Access control has traditionally been implemented either using directory services or by building a custom solution inside an application’s backend. Hierarchical directory structures, however, cannot cope with the nonhierarchical organizational and resource dependency structures that characterize multiparty distributed supply chains. Hand-rolled solutions, particularly those developed on a relational database, suffer join pain as the dataset size grows, becoming slow and unresponsive, and ultimately delivering a poor end-user experience.

图形数据库可以存储复杂、紧密相连的访问控制结构,涵盖数十亿个参与方和资源。其结构化但无架构的数据模型支持分层和非分层结构,而其可扩展的属性模型允许捕获有关系统中每个元素的丰富元数据。借助每秒可遍历数百万个关系的查询引擎,对大型复杂结构的访问查找可在几毫秒内完成。

A graph database can store complex, densely connected access control structures spanning billions of parties and resources. Its structured yet schema-free data model supports both hierarchical and nonhierarchical structures, while its extensible property model allows for capturing rich metadata regarding every element in the system. With a query engine that can traverse millions of relationships per second, access lookups over large, complex structures execute in milliseconds.

与网络管理和分析一样,图形数据库访问控制解决方案允许自上而下和自下而上的查询:

As with network management and analysis, a graph database access control solution allows for both top-down and bottom-up queries:

  • 特定管理员可以管理哪些资源(公司结构、产品、服务、协议和最终用户)?(自上而下)
  • Which resources — company structures, products, services, agreements, and end users — can a particular administrator manage? (Top-down)
  • 最终用户可以访问哪些资源?
  • Which resource can an end user access?
  • 给定特定资源,谁可以修改其访问设置?(自下而上)
  • Given a particular resource, who can modify its access settings? (Bottom-up)

图形数据库访问控制和授权解决方案特别适用于内容管理、联合授权服务、社交网络偏好和软件等领域一种软件即服务 (SaaS) 产品,与手动的关系型前辈相比,他们在几分钟到几毫秒内实现了性能提升。

Graph database access control and authorization solutions are particularly applicable in the areas of content management, federated authorization services, social networking preferences, and software as a service (SaaS) offerings, where they realize minutes to milliseconds increases in performance over their hand-rolled, relational predecessors.

真实世界的例子

Real-World Examples

在本节中,我们将详细描述三个示例用例:社交和推荐、授权和访问控制以及物流。每个用例都来自图形数据库的一个或多个生产应用程序(在这些情况下,具体来说是 Neo4j)。公司名称、上下文、数据模型和查询都经过了调整,以消除意外的复杂性并突出重要的设计和实施选择。

In this section we describe three example use cases in detail: social and recommendations, authorization and access control, and logistics. Each use case is drawn from one or more production applications of a graph database (specifically in these cases, Neo4j). Company names, context, data models, and queries have been tweaked to eliminate accidental complexity and to highlight important design and implementation choices.

社交推荐(专业社交网络)

Social Recommendations (Professional Social Network)

Talent.net 是一款社交推荐应用,它使用户能够发现自己的专业网络,并识别具有特定技能的其他用户。用户在公司工作,参与项目,并具有一个或多个兴趣或技能。根据这些信息,Talent.net 可以通过识别与用户有相同兴趣的其他订阅者来描述用户的专业网络。搜索范围可以限制在用户当前的公司,也可以扩展到涵盖整个订阅者群。Talent.net 还可以识别与当前用户有直接或间接联系的具有特定技能的个人。在为当前工作寻找主题专家时,此类搜索非常有用。

Talent.net is a social recommendations application that enables users to discover their own professional network, and identify other users with particular skill sets. Users work for companies, work on projects, and have one or more interests or skills. Based on this information, Talent.net can describe a user’s professional network by identifying other subscribers who share his or her interests. Searches can be restricted to the user’s current company, or extended to encompass the entire subscriber base. Talent.net can also identify individuals with specific skills who are directly or indirectly connected to the current user. Such searches are useful when looking for a subject matter expert for a current engagement.

Talent.net 展示了如何使用图形数据库开发强大的推理能力。尽管许多业务线应用程序都是演绎和精确的(例如计算税款或工资,或平衡借方和贷方),但当我们将归纳算法应用于数据时,最终用户价值的新领域就会打开。这就是 Talent.net 所做的。根据人们的兴趣和技能以及他们的工作历史,该应用程序可以建议可能的候选人纳入某人的专业网络。这些结果并不像工资计算那样精确,但它们无疑仍然很有用。

Talent.net shows how a powerful inferential capability can be developed using a graph database. Although many line-of-business applications are deductive and precise — calculating tax or salary, or balancing debits and credits, for example — a new seam of end-user value opens up when we apply inductive algorithms to our data. This is what Talent.net does. Based on people’s interests and skills, and their work history, the application can suggest likely candidates to include in one’s professional network. These results are not precise in the way a payroll calculation must be precise, but they are undoubtedly useful nonetheless.

Talent.net可以推断人与人之间的联系。相比之下,LinkedIn 的用户会明确表示他们认识某人或曾与某人共事。这并不是说 LinkedIn 具有精确的社交网络功能,因为它也应用归纳算法来产生进一步的洞察。但对于 Talent.net 来说,即使是主要联系也是(A)-[:KNOWS]->(B)推断出来的,而不是自愿提供的。

Talent.net infers connections between people. Contrast this with LinkedIn, where users explicitly declare they know or have worked with someone. This is not to say that LinkedIn is a precise social networking capability, because it too applies inductive algorithms to generate further insight. But with Talent.net even the primary tie, (A)-[:KNOWS]->(B), is inferred, rather than volunteered.

Talent.net 的第一个版本依赖于用户提供的兴趣、技能和工作经历信息,以便推断出他们的职业社会关系。但是,有了核心推理功能,该平台将以更少的最终用户努力产生更深入的洞察力。例如,可以从一个人日常工作活动的流程和产品中推断出技能和兴趣。无论是编写代码、编写文档还是交换电子邮件,用户都必须与后端系统交互。通过拦截这些交互,Talent.net 可以捕获表明一个人拥有哪些技能以及他们从事哪些活动的数据。有助于了解用户背景的其他数据来源包括群组成员身份和聚会列表。虽然此处介绍的用例不涵盖这些高阶推理功能,但它们的实现主要需要应用程序集成和合作协议,而不是对图表或所使用的算法进行任何重大更改。

The first version of Talent.net depends on users having supplied information regarding their interests, skills, and work history so that it can infer their professional social relations. But with the core inferential capabilities in place, the platform is set to generate even greater insight for less end-user effort. Skills and interests, for example, can be inferred from the processes and products a person’s day-to-day work activities. Whether writing code, writing a document, or exchanging emails, a user must interact with a backend system. By intercepting these interactions, Talent.net can capture data that indicates what skills a person has, and what activities they engage in. Other sources of data that help contextualize a user include group memberships and meetup lists. Although the use case presented here does not cover these higher-order inferential features, their implementation requires mostly application integration and partnership agreements rather than any significant change to the graph or the algorithms used.

Talent.net 数据模型

Talent.net data model

为了帮助描述 Talent.net 数据模型,我们创建了一个小示例图,如图 5-1所示,我们将在本节中使用它来说明 Talent.net 主要用例背后的 Cypher 查询。

To help describe the Talent.net data model, we’ve created a small sample graph, as shown in Figure 5-1, which we’ll use throughout this section to illustrate the Cypher queries behind the main Talent.net use cases.

此处显示的示例图只有两家公司,每家公司都有几名员工。员工通过WORKS_FOR关系与其雇主相连。每个员工都是INTERESTED_IN一个或多个主题,并且有WORKED_ON一个或多个项目。偶尔,来自不同公司的员工会从事同一个项目。

The sample graph shown here has just two companies, each with several employees. An employee is connected to his employer by a WORKS_FOR relationship. Each employee is INTERESTED_IN one or more topics, and has WORKED_ON one or more projects. Occasionally, employees from different companies work on the same project.

此结构解决了两个重要用例:

This structure addresses two important use cases:

  • 给定一个用户,根据共同的兴趣和技能推断社会关系 - 即识别他们的专业社交网络。
  • Given a user, infer social relations — that is, identify their professional social network — based on shared interests and skills.
  • 给定一个用户,推荐一个曾经与该用户合作过的人,或者曾经与该用户合作过的人合作过的人,具有某项特定技能的人。
  • Given a user, recommend someone that they have worked with, or who has worked with people they have worked with, who has a particular skill.

第一个用例有助于围绕共同兴趣建立社区。第二个用例有助于确定可以担任特定项目角色的人员。

The first use case helps build communities around shared interests. The second helps identify people to fill specific project roles.

格数据库 0501
图 5-1. Talent.net 社交网络示例

推断社会关系

Inferring social relations

Talent.net 的图表允许它通过找到与用户有共同兴趣的人来推断用户的专业社交网络。推荐的强度取决于共同兴趣的数量。如果 Sarah 对 Java、图表和 REST 感兴趣,Ben 对图表和 REST 感兴趣,而 Charlie 对图表、汽车和医学感兴趣,那么 Sarah 和 Ben 之间可能存在基于他们对图表和 REST 的共同兴趣的联系,而 Sarah 和 Charlie 之间也可能存在基于他们对图表的共同兴趣的联系,并且 Sarah 和 Ben 之间的联系比 Sarah 和 Charlie 之间的联系更强(两个共同兴趣对一个共同兴趣)。

Talent.net’s graph allows it to infer a user’s professional social network by finding people who share that user’s interests. The strength of the recommendation depends on the number of shared interests. If Sarah is interested in Java, graphs, and REST, Ben in graphs and REST, and Charlie in graphs, cars, and medicine, then there is a likely tie between Sarah and Ben based on their mutual interest in graphs and REST, and another tie between Sarah and Charlie based on their mutual interest in graphs, with the tie between Sarah and Ben stronger than the one between Sarah and Charlie (two shared interests versus one).

图 5-2显示了表示与用户有共同兴趣的同事的模式。subject节点指的是查询的主题(在前面的示例中,这是 Sarah)。可以在索引中查找此节点。一旦将模式锚定到主题节点,然后在图形周围展开,就会发现其余节点。

Figure 5-2 shows the pattern representing colleagues who share a user’s interests. The subject node refers to the subject of the query (in the preceding example, this is Sarah). This node can be looked up in an index. The remaining nodes will be discovered once the pattern is anchored to the subject node and then flexed around the graph.

实现该查询的 Cypher 如下所示:

The Cypher to implement this query is shown here:

MATCH  (subject:User {name:{name}})
MATCH  (subject)-[:WORKS_FOR]->(company:Company)<-[:WORKS_FOR]-(person:User),
       (subject)-[:INTERESTED_IN]->(interest)<-[:INTERESTED_IN]-(person:User)
RETURN person.name AS name,
       count(interest) AS score,
       collect(interest.name) AS interests
ORDER BY score DESC
MATCH  (subject:User {name:{name}})
MATCH  (subject)-[:WORKS_FOR]->(company:Company)<-[:WORKS_FOR]-(person:User),
       (subject)-[:INTERESTED_IN]->(interest)<-[:INTERESTED_IN]-(person:User)
RETURN person.name AS name,
       count(interest) AS score,
       collect(interest.name) AS interests
ORDER BY score DESC
格数据库 0502
图 5-2。查找与用户有共同兴趣的同事的模式

查询的工作原理如下:

The query works as follows:

  • 第一个MATCH在标记的节点中找到主题(这里是莎拉)User并将结果分配给subject标识符。
  • The first MATCH finds the subject (here, Sarah) in the nodes labeled User and assigns the result to the subject identifier.
  • MATCH然后, 第二个查询会将其User与为同一家公司工作且兴趣相同的人进行匹配。如果查询对象是为 Acme 工作的 Sarah,那么对于 Ben,MATCH将匹配两次:Ben 为 Acme 工作,对图表感兴趣(第一次匹配)和 REST(第二次匹配)。对于 Charlie,它将匹配一次:Charlie 为 Acme 工作,对图表感兴趣。
  • The second MATCH then matches this User with people who work for the same company, and who share one or more of their interests. If the subject of the query is Sarah, who works for Acme, then in the case of Ben, MATCH will match twice: Ben works for Acme, and is interested in graphs (first match), and REST (second match). In the case of Charlie, it will match once: Charlie works for Acme, and is interested in graphs.
  • RETURN创建匹配数据的投影。对于每个匹配的同事,我们提取他们的姓名,计算他们与查询主题有共同兴趣的数量(将此结果别名为score),并使用collect创建这些共同兴趣的逗号分隔列表。如果一个人有多个匹配项,就像我们示例中的 Ben 一样,count并将collect他们的匹配项聚合到返回结果中的一行中。(事实上,count和都collect可以独立执行此聚合功能。)
  • RETURN creates a projection of the matched data. For each matched colleague, we extract their name, count the number of interests they have in common with the subject of the query (aliasing this result as score), and, using collect, create a comma-separated list of these mutual interests. Where a person has multiple matches, as does Ben in our example, count and collect aggregate their matches into a single row in the returned results. (In fact, both count and collect can perform this aggregating function independently of one another.)
  • 最后,我们根据每个同事的成绩进行排序score,从高到低。
  • Finally, we order the results based on each colleague’s score, highest first.

针对我们的示例图运行此查询,以 Sarah 作为主题,产生以下结果:

Running this query against our sample graph, with Sarah as the subject, yields the following results:

+------------------------------------------------+
| 姓名 | 分数 | 兴趣 |
+------------------------------------------------+
| “本” | 2 | [“图表”,“休息”] |
| “查理” | 1 | [“图表”] |
+------------------------------------------------+
2 行
+---------------------------------------+
| name      | score | interests         |
+---------------------------------------+
| "Ben"     | 2     | ["Graphs","REST"] |
| "Charlie" | 1     | ["Graphs"]        |
+---------------------------------------+
2 rows

图 5-3显示了匹配以生成这些结果的图表部分。

Figure 5-3 shows the portion of the graph that was matched to generate these results.

格数据库 0503
图 5-3.与 Sarah 有共同兴趣的同事

请注意,此查询仅查找与 Sarah 在同一家公司工作的人员。如果我们想扩展搜索范围以查找在其他公司工作的人员,我们需要稍微修改查询:

Notice that this query only finds people who work for the same company as Sarah. If we want to extend the search to find people who work for other companies, we need to modify the query slightly:

MATCH  (subject:User {name:{name}})
MATCH  (subject)-[:INTERESTED_IN]->(interest:Topic)
         <-[:INTERESTED_IN]-(person:User),
       (person)-[:WORKS_FOR]->(company:Company)
RETURN person.name AS name,
       company.name AS company,
       count(interest) AS score,
       collect(interest.name) AS interests
ORDER BY score DESC
MATCH  (subject:User {name:{name}})
MATCH  (subject)-[:INTERESTED_IN]->(interest:Topic)
         <-[:INTERESTED_IN]-(person:User),
       (person)-[:WORKS_FOR]->(company:Company)
RETURN person.name AS name,
       company.name AS company,
       count(interest) AS score,
       collect(interest.name) AS interests
ORDER BY score DESC

变化如下:

The changes are as follows:

  • 在该MATCH子句中,我们不再要求匹配人员为查询对象所在的同一家公司工作。(但是,我们仍然会捕获匹配人员所属的公司,因为我们希望在结果中返回此信息。)
  • In the MATCH clause, we no longer require matched persons to work for the same company as the subject of the query. (We do, however, still capture the company with which a matched person is associated, because we want to return this information in the results.)
  • 在条款中RETURN我们现在包含了每个匹配人的公司详细信息。
  • In the RETURN clause we now include the company details for each matched person.

针对我们的示例数据运行此查询将返回以下结果:

Running this query against our sample data returns the following results:

+------------------------------------------------------------------------+
| 姓名 | 公司 | 分数 | 兴趣 |
+------------------------------------------------------------------------+
| “Arnold” | “Startup, Ltd” | 3 | [“Java”,“Graphs”,“REST”] |
| “Ben”| “Acme,Inc”| 2 | [“图表”,“REST”]|
| “戈登” | “Startup,Ltd” | 1 | [“图表”] |
| “查理” | “Acme,Inc” | 1 | [“图表”] |
+------------------------------------------------------------------------+
4 行
+---------------------------------------------------------------+
| name      | company        | score | interests                |
+---------------------------------------------------------------+
| "Arnold"  | "Startup, Ltd" | 3     | ["Java","Graphs","REST"] |
| "Ben"     | "Acme, Inc"    | 2     | ["Graphs","REST"]        |
| "Gordon"  | "Startup, Ltd" | 1     | ["Graphs"]               |
| "Charlie" | "Acme, Inc"    | 1     | ["Graphs"]               |
+---------------------------------------------------------------+
4 rows

图 5-4显示了匹配以产生这些结果的图表部分。

Figure 5-4 shows the portion of the graph that was matched to generate these results.

格数据库 0504
图 5-4.与 Sarah 有共同兴趣的人

尽管 Ben 和 Charlie 仍然出现在搜索结果中,但事实证明,在 Startup, Ltd. 工作的 Arnold 与 Sarah 有最多共同点:三个主题,而 Ben 有两个,Charlie 只有一个。

Although Ben and Charlie still feature in the results, it turns out that Arnold, who works for Startup, Ltd., has most in common with Sarah: three topics compared to Ben’s two and Charlie’s one.

寻找有特殊兴趣的同事

Finding colleagues with particular interests

在 Talent.net 的第二个用例中,我们不再根据共同兴趣推断社会关系,而是寻找拥有特定技能的个人,这些个人要么与查询对象一起工作,要么与与查询对象一起工作的人一起工作。通过以这种方式应用图表,我们可以根据个人与我们信任的人(或至少与我们一起工作的人)的社会关系来找到为他们配备项目角色的个人。

In the second Talent.net use case, we turn from inferring social relations based on shared interests to finding individuals who have a particular skillset, and who have either worked with the person who is the subject of the query, or worked with people who have worked with the subject. By applying the graph in this manner we can find individuals to staff project roles based on their social ties to people we trust — or at least with whom we have worked.

所讨论的社会关系源于个人曾参与过同一个项目。与前一个用例相比,社会关系是基于共同兴趣推断出来的。如果人们曾参与过同一个项目,我们就会推断出社会关系。然后,这些项目形成将两个或多个人联系在一起的中间节点。换句话说,项目是合作的一个实例,它使几个人彼此联系。我们以这种方式发现的任何人都是纳入我们结果的候选人——只要他们拥有我们正在寻找的兴趣或技能。

The social ties in question arise from individuals having worked on the same project. Contrast this with the previous use case, where the social ties were inferred based on shared interests. If people have worked on the same project, we infer a social tie. The projects, then, form intermediate nodes that bind two or more people together. In other words, a project is an instance of collaboration that has brought several people into contact with one another. Anyone we discover in this fashion is a candidate for including in our results — as long as they possess the interests or skills we are looking for.

这是一个 Cypher 查询,用于查找具有一个或多个特定兴趣的同事和同事的同事:

Here’s a Cypher query that finds colleagues, and colleagues-of-colleagues, who have one or more particular interests:

MATCH (subject:User {name:{name}})
MATCH p=(subject)-[:WORKED_ON]->(:Project)-[:WORKED_ON*0..2]-(:Project)
        <-[:WORKED_ON]-(person:User)-[:INTERESTED_IN]->(interest:Topic)
WHERE person<>subject AND interest.name IN {interests}
WITH person, interest, min(length(p)) as pathLength
ORDER BY interest.name
RETURN person.name AS name,
       count(interest) AS score,
       collect(interest.name) AS interests,
       ((pathLength - 1)/2) AS distance
ORDER BY score DESC
LIMIT {resultLimit}
MATCH (subject:User {name:{name}})
MATCH p=(subject)-[:WORKED_ON]->(:Project)-[:WORKED_ON*0..2]-(:Project)
        <-[:WORKED_ON]-(person:User)-[:INTERESTED_IN]->(interest:Topic)
WHERE person<>subject AND interest.name IN {interests}
WITH person, interest, min(length(p)) as pathLength
ORDER BY interest.name
RETURN person.name AS name,
       count(interest) AS score,
       collect(interest.name) AS interests,
       ((pathLength - 1)/2) AS distance
ORDER BY score DESC
LIMIT {resultLimit}

这是一个相当复杂的查询。让我们分解一下,更详细地看看每个部分:

This is quite a complex query. Let’s break it down little and look at each part in more detail:

  • 第一个MATCH在标记的节点中找到查询的主题,User并将结果分配给subject标识符。
  • The first MATCH finds the subject of the query in the nodes labeled User and assigns the result to the subject identifier.
  • 第二条规则MATCH查找与 有联系的人,subject他们曾在同一项目上工作过,或者曾与 合作过的人在同一项目上工作过subject。对于我们匹配的每个人,我们都会获取他的兴趣。然后,该WHERE子句会进一步细化此匹配,该子句会排除与查询主题匹配的节点,并确保我们只匹配对我们关心的事情感兴趣的人。对于每个成功的匹配,我们将匹配的整个路径(即从查询主题一直延伸到匹配的人,再到他的兴趣的路径)分配给标识符。我们稍后会更详细地 p介绍该子句。MATCH
  • The second MATCH finds people who are connected to the subject by way of having worked on the same project, or having worked on the same project as people who have worked with the subject. For each person we match, we capture his interests. This match is then further refined by the WHERE clause, which excludes nodes that match the subject of the query, and ensures that we only match people who are interested in the things we care about. For each successful match, we assign the entire path of the match — that is, the path that extends from the subject of the query all the way through the matched person to his interest — to the identifier p. We’ll look at this MATCH clause in more detail shortly.
  • WITH将结果传送到RETURN子句,同时过滤掉冗余路径。此时结果中存在冗余路径,因为同事和同事的同事通常可以通过不同的路径到达,有些路径比其他路径长。我们希望过滤掉这些较长的路径。这正是WITH子句的作用。WITH子句发出三元组,包含一个人、一个兴趣,以及从查询主题通过该人到他的兴趣的路径长度。鉴于任何特定的人/兴趣组合可能在结果中出现多次,但路径长度不同,我们希望通过将这些多行折叠为仅包含最短路径的三元组来聚合它们,我们使用来实现min(length(p)) as pathLength
  • WITH pipes the results to the RETURN clause, filtering out redundant paths as it does so. Redundant paths are present in the results at this point because colleagues and colleagues-of-colleagues are often reachable through different paths, some longer than others. We want to filter these longer paths out. That’s exactly what the WITH clause does. The WITH clause emits triples comprising a person, an interest, and the length of the path from the subject of the query through the person to his interest. Given that any particular person/interest combination may appear more than once in the results, but with different path lengths, we want to aggregate these multiple lines by collapsing them to a triple containing only the shortest path, which we do using min(length(p)) as pathLength.
  • RETURN创建数据投影,并在执行此操作时执行更多聚合。WITH子句传输到 的数据RETURN包含每个人每个兴趣的一个条目。如果某人符合提供的两个兴趣,则会有两个单独的数据条目。我们使用count和聚合这些条目collectcount为某人创建总分,collect为该人创建以逗号分隔的匹配兴趣列表。作为结果的一部分,我们还计算匹配的人与查询主题的距离。我们通过取pathLength该人的 ,减一(对于INTERESTED_IN路径末尾的关系),然后除以二(因为该人与主题之间隔着一对WORKED_ON关系)来实现此目的。最后,我们根据 对结果进行排序score,最高score优先,并根据resultLimit查询客户端提供的参数对它们进行限制。
  • RETURN creates a projection of the data, performing more aggregation as it does so. The data piped by the WITH clause to RETURN contains one entry per person per interest. If a person matches two of the supplied interests, there will be two separate data entries. We aggregate these entries using count and collect: count to create an overall score for a person, collect to create a comma-separated list of matched interests for that person. As part of the results, we also calculate how far the matched person is from the subject of the query. We do this by taking the pathLength for that person, subtracting one (for the INTERESTED_IN relationship at the end of the path), and then dividing by two (because the person is separated from the subject by pairs of WORKED_ON relationships). Finally, we order the results based on score, highest score first, and limit them according to a resultLimit parameter supplied by the query’s client.

上述查询中的第二个MATCH子句使用可变长度路径[:WORKED_ON*0..2]作为更大模式的一部分,以匹配直接与查询主题合作的人员以及与查询主题合作的人员从事同一项目的人员。由于每个人与查询主题之间相隔一到两对关系WORKED_ON,因此 Talent.net 可以将这部分查询写为MATCH p=(subject)-[:WORKED_ON*2..4]-(person)-[:INTERESTED_IN]->(interest),其中可变长度路径包含两到四个WORKED_ON关系。但是,较长的可变长度路径效率相对较低。编写此类查询时,建议将可变长度路径限制在尽可能窄的范围。为了提高查询的性能,Talent.net 使用固定长度的传出WORKED_ON关系(从主题延伸到她的第一个项目)和另一个固定长度的WORKED_ON关系(将匹配的人与项目联系起来),中间使用较短的可变长度路径。

The second MATCH clause in the preceding query uses a variable-length path, [:WORKED_ON*0..2], as part of a larger pattern to match people who have worked directly with the subject of the query, as well as people who have worked on the same project as people who have worked with the subject. Because each person is separated from the subject of the query by one or two pairs of WORKED_ON relationships, Talent.net could have written this portion of the query as MATCH p=(subject)-[:WORKED_ON*2..4]-(person)-[:INTERESTED_IN]->(interest), with a variable-length path of between two and four WORKED_ON relationships. However, long variable-length paths can be relatively inefficient. When writing such queries, it is advisable to restrict variable-length paths to as narrow a scope as possible. To increase the performance of the query, Talent.net uses a fixed-length outgoing WORKED_ON relationship that extends from the subject to her first project, and another fixed-length WORKED_ON relationship that connects the matched person to a project, with a smaller variable-length path in between.

针对我们的示例图运行此查询,并再次以 Sarah 作为查询的主题,如果我们寻找对 Java、旅游或医学感兴趣的同事和同事的同事,我们会得到以下结果:

Running this query against our sample graph, and again taking Sarah as the subject of the query, if we look for colleagues and colleagues-of-colleagues who have interests in Java, travel, or medicine, we get the following results:

+----------------------------------------------------------------+
| 姓名 | 分数 | 兴趣 | 距离 |
+----------------------------------------------------------------+
| “阿诺德” | 2 | [“Java”,“旅行”] | 2 |
| “查理” | 1 | [“药”] | 1 |
+----------------------------------------------------------------+
2 行
+--------------------------------------------------+
| name      | score | interests         | distance |
+--------------------------------------------------+
| "Arnold"  | 2     | ["Java","Travel"] | 2        |
| "Charlie" | 1     | ["Medicine"]      | 1        |
+--------------------------------------------------+
2 rows

请注意,结果是按 而score不是排序的distance。Arnold 拥有三项兴趣中的两项,因此得分高于仅有一项兴趣的 Charlie,尽管他与 Sarah 相隔两级,而 Charlie 曾与 Sarah 直接共事。

Note that the results are ordered by score, not distance. Arnold has two out of the three interests, and therefore scores higher than Charlie, who only has one, even though he is at two removes from Sarah, whereas Charlie has worked directly with Sarah.

图 5-5显示了遍历和匹配以生成这些结果的图的部分。

Figure 5-5 shows the portion of the graph that was traversed and matched to generate these results.

格数据库 0505
图 5-5.寻找具有特定兴趣的人

让我们花点时间来更详细地了解此查询的执行方式。图 5-6显示了查询执行的三个阶段。(为了视觉清晰,我们删除了标签并强调了重要的属性值。)第一阶段显示了由MATCHandWHERE子句匹配的每条路径。我们可以看到,有一条冗余路径:通过 直接匹配,但也通过和间接Charlie匹配。第二阶段表示在子句中进行的过滤。在这里,我们发出包含匹配的人、匹配的兴趣以及从主题通过匹配的人到她的兴趣的最短路径长度的三元组。第三阶段表示子句,其中我们代表每个匹配的人汇总结果,并计算她的分数和与主题的距离。Next Gen PlatformQuantum LeapEmilyWITHRETURN

Let’s take a moment to understand how this query executes in more detail. Figure 5-6 shows three stages in the execution of the query. (For visual clarity we’ve removed labels and emphasized the important property values.) The first stage shows each of the paths as they are matched by the MATCH and WHERE clauses. As we can see, there is one redundant path: Charlie is matched directly, through Next Gen Platform, but also indirectly, by way of Quantum Leap and Emily. The second stage represents the filtering that takes place in the WITH clause. Here we emit triples comprising the matched person, the matched interest, and the length of the shortest path from the subject through the matched person to her interest. The third stage represents the RETURN clause, wherein we aggregate the results on behalf of each matched person, and calculate her score and distance from the subject.

格数据库 0506
图 5-6.查询管道

添加 WORKED_WITH 关系

Adding WORKED_WITH relationships

查找具有特定兴趣的同事和同事的同事的查询是 Talent.net 网站上最常执行的查询,该网站的成功在很大程度上取决于其性能。该查询使用关系对WORKED_ON(例如('Sarah')-[:WORKED_ON]->('Next Gen Platform')<-[:WORKED_ON]-('Charlie'))来推断用户彼此合作过。虽然性能相当不错,但效率不高,因为它需要遍历两个显式关系才能推断出单个隐式关系的存在。

The query for finding colleagues and colleagues-of-colleagues with particular interests is the one most frequently executed on Talent.net’s site, and the success of the site depends in large part on its performance. The query uses pairs of WORKED_ON relationships (for example, ('Sarah')-[:WORKED_ON]->('Next Gen Platform')<-[:WORKED_ON]-('Charlie')) to infer that users have worked with one another. Although reasonably performant, this is nonetheless inefficient, because it requires traversing two explicit relationships to infer the presence of a single implicit relationship.

为了消除这种低效率,Talent.net 决定预先计算一种新的关系,WORKED_WITH从而为这些性能关键型访问模式提供快捷方式,丰富图形。正如我们在“迭代和增量开发”中讨论的那样,通过在两个节点之间添加直接关系来优化图形访问是很常见的,否则这两个节点只能通过中介连接。

To eliminate this inefficiency, Talent.net decided to precompute a new kind of relationship, WORKED_WITH, thereby enriching the graph with shortcuts for these performance-critical access patterns. As we discussed in “Iterative and Incremental Development”, it’s quite common to optimize graph access by adding a direct relationship between two nodes that would otherwise be connected only by way of intermediaries.

就 Talent.net 域而言,WORKED_WITH是双向关系。然而,在图中,它是使用单向关系实现的。尽管关系的方向通常可以为其定义添加有用的语义,但在这种情况下,方向是没有意义的。这不是一个重大问题,只要使用WORKED_WITH关系操作的查询忽略关系方向即可。

In terms of the Talent.net domain, WORKED_WITH is a bidirectional relationship. In the graph, however, it is implemented using a unidirectional relationship. Although a relationship’s direction can often add useful semantics to its definition, in this instance the direction is meaningless. This isn’t a significant issue, so long as queries that operate with WORKED_WITH relationships ignore the relationship direction.


笔记

图形数据库支持以同样低的成本在任一方向遍历关系,因此是否包含互惠关系的决定应从领域出发。例如,PREVIOUSNEXT在链接列表中可能不是同时需要的,但在表示情绪的社交网络中,明确谁爱谁很重要,而不是假设互惠。

Graph databases support traversal of relationships in either direction at the same low cost, so the decision as to whether to include a reciprocal relationship should be driven from the domain. For example, PREVIOUS and NEXT may not both be necessary in a linked list, but in a social network that represents sentiment, it is important to be explicit about who loves whom, and not assume reciprocity.


计算用户的WORKED_WITH关系并将其添加到图表中并不困难,在资源消耗方面也不是特别昂贵。但是,它可以将更新用户个人资料的任何最终用户交互增加几毫秒,因此 Talent.net 决定以异步方式执行此操作,而不是最终用户活动。每当用户更改其项目历史记录时,Talent.net 都会将作业添加到队列中。此作业会重新计算用户的WORKED_WITH关系。单个写入线程轮询此队列并使用以下 Cypher 语句执行作业

Calculating a user’s WORKED_WITH relationships and adding them to the graph isn’t difficult, nor is it particularly expensive in terms of resource consumption. It can, however, add milliseconds to any end-user interactions that update a user’s profile with new project information, so Talent.net has decided to perform this operation asynchronously to end-user activities. Whenever a user changes his project history, Talent.net adds a job to a queue. This job recalculates the user’s WORKED_WITH relationships. A single writer thread polls this queue and executes the jobs using the following Cypher statement:

MATCH (subject:User {name:{name}})
MATCH (subject)-[:WORKED_ON]->()<-[:WORKED_ON]-(person:User)
WHERE NOT((subject)-[:WORKED_WITH]-(person))
WITH DISTINCT subject, person
CREATE UNIQUE (subject)-[:WORKED_WITH]-(person)
RETURN subject.name AS startName, person.name AS endName
MATCH (subject:User {name:{name}})
MATCH (subject)-[:WORKED_ON]->()<-[:WORKED_ON]-(person:User)
WHERE NOT((subject)-[:WORKED_WITH]-(person))
WITH DISTINCT subject, person
CREATE UNIQUE (subject)-[:WORKED_WITH]-(person)
RETURN subject.name AS startName, person.name AS endName

图 5-7显示了我们的示例图在丰富WORKED_WITH关系之后的样子。

Figure 5-7 shows what our sample graph looks like once it has been enriched with WORKED_WITH relationships.

格德布 0507
图 5-7. Talent.netWORKED_WITH关系丰富的图表

使用丰富的图表,Talent.net 现在可以使用我们查找的稍微简单一点的查询版本来查找具有特定兴趣的同事和同事的同事早些时候:

Using the enriched graph, Talent.net now finds colleagues and colleagues-of-colleagues with particular interests using a slightly simpler version of the query we looked at earlier:

MATCH (subject:User {name:{name}})
MATCH p=(subject)-[:WORKED_WITH*0..1]-(:Person)-[:WORKED_WITH]-(person:User)
        -[:INTERESTED_IN]->(interest:Topic)
WHERE person<>subject AND interest.name IN {interests}
WITH person, interest, min(length(p)) as pathLength
RETURN person.name AS name,
       count(interest) AS score,
       collect(interest.name) AS interests,
       (pathLength - 1) AS distance
ORDER BY score DESC
LIMIT {resultLimit}
MATCH (subject:User {name:{name}})
MATCH p=(subject)-[:WORKED_WITH*0..1]-(:Person)-[:WORKED_WITH]-(person:User)
        -[:INTERESTED_IN]->(interest:Topic)
WHERE person<>subject AND interest.name IN {interests}
WITH person, interest, min(length(p)) as pathLength
RETURN person.name AS name,
       count(interest) AS score,
       collect(interest.name) AS interests,
       (pathLength - 1) AS distance
ORDER BY score DESC
LIMIT {resultLimit}

授权和访问控制

Authorization and Access Control

TeleGraph Communications 是一家国际通信服务公司。数百万国内和企业用户订阅了其产品和服务。多年来,该公司一直为其最大的企业客户提供自助管理账户的功能。使用基于浏览器的应用程序,每个客户组织内的管理员都可以代表其员工添加和删除服务。为了确保用户和管理员只能查看和更改组织的部分内容以及他们有权管理的产品和服务,该应用程序采用了复杂的访问控制系统,该系统将权限分配给数千万个产品和服务实例中的数百万用户。

TeleGraph Communications is an international communications services company. Millions of domestic and business users subscribe to its products and services. For several years, it has offered its largest business customers the ability to self-service their accounts. Using a browser-based application, administrators within each of these customer organizations can add and remove services on behalf of their employees. To ensure that users and administrators see and change only those parts of the organization and the products and services they are entitled to manage, the application employs a complex access control system, which assigns privileges to many millions of users across tens of millions of product and service instances.

TeleGraph 已决定用图形数据库解决方案取代现有的访问控制系统。这里有两个驱动因素:性能和业务响应能力。

TeleGraph has decided to replace the existing access control system with a graph database solution. There are two drivers here: performance and business responsiveness.

性能问题困扰了 TeleGraph 的自助服务应用程序好几年。原始系统基于关系数据库,该数据库使用递归连接来模拟复杂的组织结构和产品层次结构,并使用存储过程来实现访问控制业务逻辑。由于数据模型的连接密集型特性,许多最重要的查询都慢得令人无法接受。对于大型公司来说,生成管理员可以管理的事物的视图需要花费数分钟。这会造成非常糟糕的用户体验,并阻碍自助服务带来的创收机会。

Performance issues have dogged TeleGraph’s self-service application for several years. The original system is based on a relational database, which uses recursive joins to model complex organizational structures and product hierarchies, and stored procedures to implement the access control business logic. Because of the join-intensive nature of the data model, many of the most important queries are unacceptably slow. For large companies, generating a view of the things an administrator can manage takes many minutes. This creates a very poor user experience, and hampers the revenue-generating opportunities presented by the self-service offering.

TeleGraph 雄心勃勃地计划进军新地区和新市场,从而有效地将其客户群扩大一个数量级。但影响原始应用程序的性能问题表明,它不再适合当今的需求,更不用说未来的需求了。相比之下,图形数据库解决方案提供了应对快速变化的市场所需的性能、可扩展性和适应性。

TeleGraph has ambitious plans to move into new regions and markets, effectively increasing its customer base by an order of magnitude. But the performance issues that affect the original application suggest it is no longer fit for today’s needs, never mind tomorrow’s. A graph database solution, in contrast, offers the performance, scalability, and adaptiveness necessary for dealing with a rapidly changing market.

TeleGraph 数据模型

TeleGraph data model

图 5-8 显示了 TeleGraph 数据模型的一个示例。(为了清楚起见,标签只出现在每组节点的顶部一次,而不是附加到每个节点。在实际数据中,所有节点至少有一个标签。)

Figure 5-8 shows a sample of the TeleGraph data model. (For clarity, labels are presented at the top of each set of nodes once only, rather than being attached to every node. In the real data, all nodes have at least one label.)

格数据库 0508
图 5-8.访问控制图

该模型包含两个层次结构。在第一个层次结构中,每个客户组织内的管理员被分配到不同的组。然后根据该组织的组织结构为这些组授予各种权限:

This model comprises two hierarchies. In the first hierarchy, administrators within each customer organization are assigned to groups. These groups are then accorded various permissions against that organization’s organizational structure:

  • ALLOWED_INHERIT将管理员组连接到组织单位,从而允许该组内的管理员管理组织单位。此权限由父组织单位的子级继承。我们在 TeleGraph 示例数据模型中看到了继承权限的示例,即Group 1和之间的关系Acme,以及 的子级Acme。使用关系连接到。作为 的成员,由于这种关系, 可以管理Spinoff的员工。 Group 1AcmeALLOWED_INHERITBenGroup 1Acme SpinoffALLOWED_INHERIT
  • ALLOWED_INHERIT connects an administrator group to an organizational unit, thereby allowing administrators within that group to manage the organizational unit. This permission is inherited by children of the parent organizational unit. We see an example of inherited permissions in the TeleGraph example data model in the relationships between Group 1 and Acme, and the child of Acme, Spinoff. Group 1 is connected to Acme using an ALLOWED_INHERIT relationship. Ben, as a member of Group 1, can manage employees both of Acme and Spinoff thanks to this ALLOWED_INHERIT relationship.
  • ALLOWED_DO_NOT_INHERIT将管理员组连接到组织单位,使得该组内的管理员可以管理组织单位,但不能管理其子级。Sarah作为的成员Group 2,可以管理Acme,但不能管理其子级Spinoff,因为通过关系Group 2连接到,而不是关系。 AcmeALLOWED_DO_NOT_INHERITALLOWED_INHERIT
  • ALLOWED_DO_NOT_INHERIT connects an administrator group to an organizational unit in a way that allows administrators within that group to manage the organizational unit, but not any of its children. Sarah, as a member of Group 2, can administer Acme, but not its child Spinoff, because Group 2 is connected to Acme by an ALLOWED_DO_NOT_INHERIT relationship, not an ALLOWED_INHERIT relationship.
  • DENIED禁止管理员访问组织单位。此权限由父组织单位的子组织单位继承。在 TeleGraph 图中,最好的例子是 和Liz她对Big CoAcquired LtdSubsidiary和 的权限One-Map Shop。由于她是 的成员Group 4并且拥有 的ALLOWED_INHERIT权限Big Co,所以Liz可以管理Big Co。但尽管这是一种可继承的关系,Liz无法管理Acquired LtdSubsidiary因为是 的成员,Group 5所以可以访问及其子组织单位(包括)。但是,由于被授予 权限,所以可以管理 ,这是 所属的最后一个组。 LizDENIEDAcquired LtdSubsidiaryLizOne-Map ShopALLOWED_DO_NOT_INHERITGroup 6Liz
  • DENIED forbids administrators from accessing an organizational unit. This permission is inherited by children of the parent organizational unit. In the TeleGraph diagram, this is best illustrated by Liz and her permissions with respect to Big Co, Acquired Ltd, Subsidiary, and One-Map Shop. As a result of her membership of Group 4 and its ALLOWED_INHERIT permission on Big Co, Liz can manage Big Co. But despite this being an inheritable relationship, Liz cannot manage Acquired Ltd or Subsidiary because Group 5, of which Liz is a member, is DENIED access to Acquired Ltd and its children (which includes Subsidiary). Liz can, however, manage One-Map Shop, thanks to an ALLOWED_DO_NOT_INHERIT permission granted to Group 6, the last group to which Liz belongs.

DENIED优先于ALLOWED_INHERIT,但从属于。因此,如果管理人通过和 的ALLOWED_DO_NOT_INHERIT方式与公司建立联系,则以 为准。ALLOWED_DO_NOT_INHERITDENIEDALLOWED_DO_NOT_INHERIT

DENIED takes precedence over ALLOWED_INHERIT, but is subordinate to ALLOWED_DO_NOT_INHERIT. Therefore, if an administrator is connected to a company by way of ALLOWED_DO_NOT_INHERIT and DENIED, ALLOWED_DO_NOT_INHERIT prevails.

查找管理员可访问的所有资源

Finding all accessible resources for an administrator

TeleGraph 应用程序使用许多不同的 Cypher 查询。我们在此仅介绍其中的几个。

The TeleGraph application uses many different Cypher queries. We’ll look at just a few of them here.

首先是能够找到管理员可以访问的所有资源。每当现场管理员登录系统时,他都会看到他可以管理的所有员工和员工帐户的列表。此列表是根据以下查询返回的结果生成的:

First up is the ability to find all the resources an administrator can access. Whenever an onsite administrator logs in to the system, he is presented with a list of all the employees and employee accounts he can administer. This list is generated based on the results returned from the following query:

MATCH (admin:Admin {name:{adminName}})
MATCH paths=(admin)-[:MEMBER_OF]->(:Group)-[:ALLOWED_INHERIT]->(:Company)
        <-[:CHILD_OF*0..3]-(company:Company)<-[:WORKS_FOR]-(employee:Employee)
        -[:HAS_ACCOUNT]->(account:Account)
WHERE NOT ((admin)-[:MEMBER_OF]->(:Group)
            -[:DENIED]->(:Company)<-[:CHILD_OF*0..3]-(company))
RETURN employee.name AS employee, account.name AS account
UNION
MATCH (admin:Admin {name:{adminName}})
MATCH paths=(admin)-[:MEMBER_OF]->(:Group)-[:ALLOWED_DO_NOT_INHERIT]->(:Company)
        <-[:WORKS_FOR]-(employee:Employee)-[:HAS_ACCOUNT]->(account:Account)
RETURN employee.name AS employee, account.name AS account
MATCH (admin:Admin {name:{adminName}})
MATCH paths=(admin)-[:MEMBER_OF]->(:Group)-[:ALLOWED_INHERIT]->(:Company)
        <-[:CHILD_OF*0..3]-(company:Company)<-[:WORKS_FOR]-(employee:Employee)
        -[:HAS_ACCOUNT]->(account:Account)
WHERE NOT ((admin)-[:MEMBER_OF]->(:Group)
            -[:DENIED]->(:Company)<-[:CHILD_OF*0..3]-(company))
RETURN employee.name AS employee, account.name AS account
UNION
MATCH (admin:Admin {name:{adminName}})
MATCH paths=(admin)-[:MEMBER_OF]->(:Group)-[:ALLOWED_DO_NOT_INHERIT]->(:Company)
        <-[:WORKS_FOR]-(employee:Employee)-[:HAS_ACCOUNT]->(account:Account)
RETURN employee.name AS employee, account.name AS account

与本节中我们将要查看的所有其他查询一样,此查询由两个单独的查询组成,这两个查询由一个UNION运算符连接。运算符之前的查询UNION处理ALLOWED_INHERIT由任何关系限定的关系DENIED。运算符之后的查询UNION处理任何ALLOWED_DO_NOT_INHERIT权限。我们将要查看的所有访问控制示例查询中都重复了这种模式ALLOWED_INHERIT(减去DENIED,后跟ALLOWED_DO_NOT_INHERIT)。

Like all the other queries we’ll be looking at in this section, this query comprises two separate queries joined by a UNION operator. The query before the UNION operator handles ALLOWED_INHERIT relationships qualified by any DENIED relationships. The query following the UNION operator handles any ALLOWED_DO_NOT_INHERIT permissions. This pattern, ALLOWED_INHERIT minus DENIED, followed by ALLOWED_DO_NOT_INHERIT, is repeated in all of the access control example queries that we’ll be looking at.

这里的第一个查询(即运算符之前的查询UNION)可以分解如下:

The first query here, the one before the UNION operator, can be broken down as follows:

  • 第一个MATCH从标有的节点中选择已登录的管理员Administrator,并将结果绑定到admin标识符。
  • The first MATCH selects the logged-in administrator from the nodes labeled Administrator, and binds the result to the admin identifier.
  • MATCH匹配此管理员所属的所有组,以及这些组中通过关系连接的所有母公司ALLOWED_INHERITMATCH然后使用可变长度路径 ( [:CHILD_OF*0..3]) 发现这些母公司的子公司,然后发现与所有匹配公司(无论是母公司还是子公司)相关的员工和帐户。此时,查询已匹配通过ALLOWED_INHERIT关系可访问的所有公司、员工和帐户。
  • MATCH matches all the groups to which this administrator belongs, and from these groups, all the parent companies connected by way of an ALLOWED_INHERIT relationship. The MATCH then uses a variable-length path ([:CHILD_OF*0..3]) to discover children of these parent companies, and thereafter the employees and accounts associated with all matched companies (whether parent company or child). At this point, the query has matched all companies, employees, and accounts accessible by way of ALLOWED_INHERIT relationships.
  • WHERE消除其company或母公司通过关系DENIED与管理员组相连的匹配项。此WHERE子句针对每个匹配项调用。如果节点与匹配项绑定的节点 DENIED之间存在任何关系,则该匹配项将被消除。admincompany
  • WHERE eliminates matches whose company, or parent companies, are connected by way of a DENIED relationship to the administrator’s groups. This WHERE clause is invoked for each match. If there is a DENIED relationship anywhere between the admin node and the company node bound by the match, that match is eliminated.
  • RETURN以员工姓名和账户列表的形式创建匹配数据的投影。
  • RETURN creates a projection of the matched data in the form of a list of employee names and accounts.

此处的第二个查询跟在UNION运算符后面,稍微简单一些:

The second query here, following the UNION operator, is a little simpler:

  • 第一个MATCH从标有的节点中选择已登录的管理员Administrator,并将结果绑定到admin标识符。
  • The first MATCH selects the logged-in administrator from the nodes labeled Administrator, and binds the result to the admin identifier.
  • 第二个MATCH只是匹配通过关系直接与管理员组相连的公司(加上员工和帐户)ALLOWED_DO_NOT_INHERIT
  • The second MATCH simply matches companies (plus employees and accounts) that are directly connected to an administrator’s groups by way of an ALLOWED_DO_NOT_INHERIT relationship.

UNION运算符将这两个查询的结果连接在一起,消除任何重复项。请注意,RETURN每个查询中的子句必须包含相同的结果投影。换句话说,两个结果集中的列名必须匹配。

The UNION operator joins the results of these two queries together, eliminating any duplicates. Note that the RETURN clause in each query must contain the same projection of the results. In other words, the column names in the two result sets must match.

图 5-9显示了此查询如何匹配Sarah示例 TeleGraph 图中 的所有可访问资源。请注意,由于到DENIED的关系,无法管理和。Group 2SkunkworkzSarahKateAccount 7

Figure 5-9 shows how this query matches all accessible resources for Sarah in the sample TeleGraph graph. Note that, because of the DENIED relationship from Group 2 to Skunkworkz, Sarah cannot administer Kate and Account 7.


笔记

Cypher 支持UNIONUNION ALL运算符。UNION从最终结果集中消除重复结果,而UNION ALL包括任何重复结果。

Cypher supports both UNION and UNION ALL operators. UNION eliminates duplicate results from the final result set, whereas UNION ALL includes any duplicates.


格德布 0509
图 5-9。查找用户可访问的所有资源

确定管理员是否有权访问资源

Determining whether an administrator has access to a resource

我们刚刚查看的查询返回了管理员可以管理的员工和帐户列表。在 Web 应用程序中,这些资源(员工、帐户)中的每一个都可以通过其自己的 URI 访问。给定一个友好的 URI(例如,http://TeleGraph/accounts/5436),如何阻止某人破解 URI 并非法访问帐户?

The query we’ve just looked at returned a list of employees and accounts an administrator can manage. In a web application, each of these resources (employee, account) is accessible through its own URI. Given a friendly URI (e.g., http://TeleGraph/accounts/5436), what’s to stop someone from hacking a URI and gaining illegal access to an account?

我们需要的是一个查询,用于确定管理员是否有权访问特定资源。该查询如下:

What’s needed is a query that will determine whether an administrator has access to a specific resource. This is that query:

MATCH (admin:Admin {name:{adminName}}),
      (company:Company)-[:WORKS_FOR|HAS_ACCOUNT*1..2]
        -(resource:Resource {name:{resourceName}})
MATCH p=(admin)-[:MEMBER_OF]->(:Group)-[:ALLOWED_INHERIT]->(:Company)
        <-[:CHILD_OF*0..3]-(company)
WHERE NOT ((admin)-[:MEMBER_OF]->(:Group)-[:DENIED]->(:Company)
            <-[:CHILD_OF*0..3]-(company))
RETURN count(p) AS accessCount
UNION
MATCH (admin:Admin {name:{adminName}}),
      (company:Company)-[:WORKS_FOR|HAS_ACCOUNT*1..2]
        -(resource:Resource {name:{resourceName}})
MATCH p=(admin)-[:MEMBER_OF]->()-[:ALLOWED_DO_NOT_INHERIT]->(company)
RETURN count(p) AS accessCount
MATCH (admin:Admin {name:{adminName}}),
      (company:Company)-[:WORKS_FOR|HAS_ACCOUNT*1..2]
        -(resource:Resource {name:{resourceName}})
MATCH p=(admin)-[:MEMBER_OF]->(:Group)-[:ALLOWED_INHERIT]->(:Company)
        <-[:CHILD_OF*0..3]-(company)
WHERE NOT ((admin)-[:MEMBER_OF]->(:Group)-[:DENIED]->(:Company)
            <-[:CHILD_OF*0..3]-(company))
RETURN count(p) AS accessCount
UNION
MATCH (admin:Admin {name:{adminName}}),
      (company:Company)-[:WORKS_FOR|HAS_ACCOUNT*1..2]
        -(resource:Resource {name:{resourceName}})
MATCH p=(admin)-[:MEMBER_OF]->()-[:ALLOWED_DO_NOT_INHERIT]->(company)
RETURN count(p) AS accessCount

此查询的工作原理是确定管理员是否有权访问员工或帐户所属的公司。给定一个员工或帐户,我们需要确定此资源与哪家公司相关联,然后确定管理员是否有权访问该公司。

This query works by determining whether an administrator has access to the company to which an employee or an account belongs. Given an employee or account, we need to determine the company with which this resource is associated, and then work out whether the administrator has access to that company.

我们如何识别员工或帐户所属的公司?通过将两者标记为Resource(以及 或Company)来做到这一点Account。员工通过WORKS_FOR关系与公司资源相关联。帐户通过员工与公司相关联。HAS_ACCOUNT将员工连接到帐户。WORKS_FOR然后将该员工连接到公司。换句话说,员工距离公司只有一步之遥,而帐户距离公司有两步之遥。

How do we identify the company to which an employee or account belongs? By labelling both as Resource (as well as either Company or Account). An employee is connected to a company resource by a WORKS_FOR relationship. An account is associated with a company by way of an employee. HAS_ACCOUNT connects the employee to the account. WORKS_FOR then connects this employee to the company. In other words, an employee is one hop away from a company, whereas an account is two hops away from a company.

通过这一点了解,我们可以看到,此资源授权检查与查找所有公司、员工和帐户的查询类似 - 只有几个细微的差别:

With that bit of insight, we can see that this resource authorization check is similar to the query for finding all companies, employees, and accounts — only with several small differences:

  • 第一个MATCHfindz 查找员工或帐户所属的公司。它使用 Cypher 的 OR 运算|符 来匹配深度一或二的 WORKS_FOR和关系。HAS_ACCOUNT
  • The first MATCH findz the company to which an employee or account belongs. It uses Cypher’s OR operator, |, to match both WORKS_FOR and HAS_ACCOUNT relationships at depth one or two.
  • 运算符消除匹配WHERE之前的查询子句, 其中所讨论的通过关系与管理员的某个组相连。 UNIONcompanyDENIED
  • The WHERE clause in the query before the UNION operator eliminates matches where the company in question is connected to one of the administrator’s groups by way of a DENIED relationship.
  • RETURN运算符之前和之后的查询子句 返回UNION匹配项的数量。要使管理员有权访问资源,这两个accessCount值中的一个或两个必须大于 0。
  • The RETURN clauses for the queries before and after the UNION operator return a count of the number of matches. For an administrator to have access to a resource, one or both of these accessCount values must be greater than 0.

由于该UNION运算符消除了重复结果,因此此查询的总体结果集可以包含一个或两个值。用于确定管理员是否有权访问资源的客户端算法可以用 Java 轻松表达:

Because the UNION operator eliminates duplicate results, the overall result set for this query can contain either one or two values. The client-side algorithm for determining whether an administrator has access to a resource can be expressed easily in Java:

private boolean isAuthorized( Result result )
{
    Iterator<Long> accessCountIterator = result.columnAs( "accessCount" );
    while ( accessCountIterator.hasNext() )
    {
        if (accessCountIterator.next() > 0L)
        {
            return true;
        }
    }
    return false;
}
private boolean isAuthorized( Result result )
{
    Iterator<Long> accessCountIterator = result.columnAs( "accessCount" );
    while ( accessCountIterator.hasNext() )
    {
        if (accessCountIterator.next() > 0L)
        {
            return true;
        }
    }
    return false;
}

查找帐户管理员

Finding administrators for an account

前两个查询表示图表的“自上而下”视图。我们将在这里讨论的最后一个 TeleGraph 查询提供了数据的“自下而上”视图。给定一个资源(员工或帐户),谁可以管理它?以下是查询:

The previous two queries represent “top-down” views of the graph. The last TeleGraph query we’ll discuss here provides a “bottom-up” view of the data. Given a resource — an employee or account — who can manage it? Here’s the query:

MATCH (resource:Resource {name:{resourceName}})
MATCH p=(resource)-[:WORKS_FOR|HAS_ACCOUNT*1..2]-(company:Company)
        -[:CHILD_OF*0..3]->()<-[:ALLOWED_INHERIT]-()<-[:MEMBER_OF]-(admin:Admin)
WHERE NOT ((admin)-[:MEMBER_OF]->(:Group)-[:DENIED]->(:Company)
            <-[:CHILD_OF*0..3]-(company))
RETURN admin.name AS admin
UNION
MATCH (resource:Resource {name:{resourceName}})
MATCH p=(resource)-[:WORKS_FOR|HAS_ACCOUNT*1..2]-(company:Company)
        <-[:ALLOWED_DO_NOT_INHERIT]-(:Group)<-[:MEMBER_OF]-(admin:Admin)
RETURN admin.name AS admin
MATCH (resource:Resource {name:{resourceName}})
MATCH p=(resource)-[:WORKS_FOR|HAS_ACCOUNT*1..2]-(company:Company)
        -[:CHILD_OF*0..3]->()<-[:ALLOWED_INHERIT]-()<-[:MEMBER_OF]-(admin:Admin)
WHERE NOT ((admin)-[:MEMBER_OF]->(:Group)-[:DENIED]->(:Company)
            <-[:CHILD_OF*0..3]-(company))
RETURN admin.name AS admin
UNION
MATCH (resource:Resource {name:{resourceName}})
MATCH p=(resource)-[:WORKS_FOR|HAS_ACCOUNT*1..2]-(company:Company)
        <-[:ALLOWED_DO_NOT_INHERIT]-(:Group)<-[:MEMBER_OF]-(admin:Admin)
RETURN admin.name AS admin

与之前一样,该查询由两个独立的查询组成,这两个查询通过UNION运算符连接在一起。特别值得注意的是以下子句:

As before, the query consists of two independent queries joined by a UNION operator. Of particular note are the following clauses:

  • 第一个MATCH子句使用Resource标签,这使得它能够识别员工和账户。
  • The first MATCH clause uses a Resource label, which allows it to identify both employees and accounts.
  • 第二个MATCH子句包含一个可变长度的路径表达式,该表达式使用|运算符指定深度为一个或两个关系的路径,其关系类型包括WORKS_FORHAS_ACCOUNT。此表达式适应查询的主题可能是员工或帐户的事实。
  • The second MATCH clauses contain a variable-length path expression that uses the | operator to specify a path that is one or two relationships deep, and whose relationship types comprise WORKS_FOR or HAS_ACCOUNT. This expression accommodates the fact that the subject of the query may be either an employee or an account.

图 5-10显示了在要求查找 的管理员时查询匹配的图形部分Account 10

Figure 5-10 shows the portions of the graph matched by the query when asked to find the administrators for Account 10.

格数据库 0510
图 5-10.查找特定帐户的管理员

地理空间和物流

Geospatial and Logistics

Global Post 是一家全球快递公司,其国内业务每天向 3000 多万个地址递送数百万个包裹。近年来,由于网上购物的兴起,包裹数量大幅增加。亚马逊和 eBay 的包裹数量目前占 Global Post 每天递送包裹数量的一半以上。

Global Post is a global courier whose domestic operation delivers millions of parcels to more than 30 million addresses each day. In recent years, as a result of the rise in online shopping, the number of parcels has increased significantly. Amazon and eBay deliveries now account for more than half of the parcels routed and delivered by Global Post each day.

随着包裹数量不断增长,并且面临来自其他快递服务的激烈竞争,环球邮政已开始一项大规模变革计划,以升级其包裹网络的各个方面,包括建筑物、设备、系统和流程。

With parcel volumes continuing to grow, and facing strong competition from other courier services, Global Post has begun a large change program to upgrade all aspects of its parcel network, including buildings, equipment, systems, and processes.

包裹网络中最重要的和时间要求最高的组件之一是路线计算引擎。每秒有 1 到 3 千个包裹进入网络。包裹进入网络后,会根据目的地进行机械分类。为了在此过程中保持流量稳定,引擎必须在包裹到达分拣设备必须做出选择的地点之前计算包裹的路线,而分拣设备在包裹进入网络后几秒钟内就会做出选择 — 因此对引擎的时间要求非常严格。

One of the most important and time-critical components in the parcel network is the route calculation engine. Between one and three thousand parcels enter the network each second. As parcels enter the network they are mechanically sorted according to their destination. To maintain a steady flow during this process, the engine must calculate a parcel’s route before it reaches a point where the sorting equipment has to make a choice, which happens only seconds after the parcel has entered the network — hence the strict time requirements on the engine.

引擎不仅必须在几毫秒内安排包裹路线,还必须根据特定时期的路线安排包裹路线。包裹路线全年都在变化,例如,圣诞节期间的卡车、送货员和收件量比夏季更多。因此,引擎必须仅使用特定时期可用的路线进行计算。

Not only must the engine route parcels in milliseconds, but it must do so according to the routes scheduled for a particular period. Parcel routes change throughout the year, with more trucks, delivery people, and collections over the Christmas period than during the summer, for example. The engine must, therefore, apply its calculations using only those routes that are available for a particular period.

除了适应不同的路线和包裹运输水平外,新的包裹网络还必须允许重大的变化和发展。Global Post 今天开发的平台将成为其未来 10 年或更长时间内运营的关键业务基础。在此期间,该公司预计网络的大部分(包括设备、场所和运输路线)将发生变化,以适应业务环境的变化。因此,路线计算引擎所依赖的数据模型必须允许快速且显著的模式演变。

On top of accommodating different routes and levels of parcel traffic, the new parcel network must also allow for significant change and evolution. The platform that Global Post develops today will form the business-critical basis of its operations for the next 10 years or more. During that time, the company anticipates large portions of the network — including equipment, premises, and transport routes — will change to match changes in the business environment. The data model underlying the route calculation engine must, therefore, allow for rapid and significant schema evolution.

全球邮政数据模型

Global Post data model

图 5-11 显示了全球邮政包裹网络的一个简单示例。该网络由包裹中心组成,包裹中心与配送基地相连,每个配送基地覆盖多个配送区域。这些配送区域又细分为配送段,覆盖许多配送单元。全球有大约 25 个国家包裹中心和大约 200 万个配送单元(对应邮政编码)。

Figure 5-11 shows a simple example of the Global Post parcel network. The network comprises parcel centers, which are connected to delivery bases, each of which covers several delivery areas. These delivery areas, in turn, are subdivided into delivery segments covering many delivery units. There are around 25 national parcel centers and roughly 2 million delivery units (corresponding to postal or zip codes).

格德布 0511
图 5-11.全球邮政网络中的元素

随着时间的推移,配送路线会发生变化。图5-12、5-135-14显示了三个不同的配送期。对于任何给定的时间段,配送基地与任何特定配送区域或路段之间最多只有一条路线。相比之下,配送基地和包裹中心之间全年有多条路线。因此,对于任何给定的时间点,图表的下半部分(每个配送基地下方的单独子图)由简单的树结构组成,而图表的上半部分(由配送基地和包裹中心组成)则更加相互关联。

Over time, the delivery routes change. Figures 5-12, 5-13, and 5-14 show three distinct delivery periods. For any given period, there is at most one route between a delivery base and any particular delivery area or segment. In contrast, there are multiple routes between delivery bases and parcel centers throughout the year. For any given point in time, therefore, the lower portions of the graph (the individual subgraphs below each delivery base) comprise simple tree structures, whereas the upper portions of the graph, made up of delivery bases and parcel centers, are more interconnected.

格德布 0512
图 5-12.第 1 阶段配送网络结构
格德布 0513
图 5-13第 2 阶段配送网络结构
格德布 0514
图 5-14第 3 阶段配送网络结构

请注意,交货单元不包含在生产数据中。这是因为每个交货单元始终与相同的交货段相关联,而与时间段无关。由于这个不变量,可以通过其许多交货单元对每个交货段进行索引。要计算到特定交货单元的路线,系统实际上只需要计算到其相关交货段的路线,可以使用交货单元作为键从索引中恢复其名称。这种优化既有助于减少生产图的大小,也有助于减少计算路线所需的遍历次数。

Notice that delivery units are not included in the production data. This is because each delivery unit is always associated with the same delivery segments, irrespective of the period. Because of this invariant, it is possible to index each delivery segment by its many delivery units. To calculate the route to a particular delivery unit, the system need only actually calculate the route to its associated delivery segment, the name of which can be recovered from the index using the delivery unit as a key. This optimization helps both reduce the size of the production graph, and reduce the number of traversals needed to calculate a route.

生产数据库包含所有不同交付期的详细信息。如图5-15所示,如此多特定于时期的关系的存在构成了一个紧密连接的图。

The production database contains the details of all the different delivery periods. As shown in Figure 5-15, the presence of so many period-specific relationships makes for a densely connected graph.

在生产数据中,节点通过多个关系连接,每个关系都带有一个start_dateandend_date属性的时间戳。关系有两种类型:CONNECTED_TO,连接包裹中心和配送基地;DELIVERY_ROUTE,连接配送基地和配送区域,以及配送区域和配送路段。这两种不同类型的关系有效地将图划分为上部和下部,这种策略提供了非常高效的遍历。图 5-16CONNECTED_TO显示了将包裹中心连接到配送基地的三个带时间戳的关系。

In the production data, nodes are connected by multiple relationships, each of which is timestamped with a start_date and end_date property. Relationships are of two types: CONNECTED_TO, which connects parcel centers and delivery bases, and DELIVERY_ROUTE, which connects delivery bases to delivery areas, and delivery areas to delivery segments. These two different types of relationships effectively partition the graph into its upper and lower parts, a strategy that provides for very efficient traversals. Figure 5-16 shows three of the timestamped CONNECTED_TO relationships connecting a parcel center to a delivery base.

格德布 0515
图 5-15.全球邮政网络示例

路线计算

Route calculation

如上一节所述,“CONNECTED_TO和”DELIVERY_ROUTE关系将图分为两部分,上部由复杂连接的包裹中心和配送中心组成,下部由配送基地、配送区域和配送段组成,针对任何给定时间段以简单的树形结构组织。

As described in the previous section, the CONNECTED_TO and DELIVERY_ROUTE relationships partition the graph into upper and lower parts, with the upper parts made up of complexly connected parcel centers and delivery centers, the lower parts of delivery bases, delivery areas, and delivery segments organized — for any given period — in simple tree structures.

路线计算涉及在图表下部找到两个位置之间的最便宜路线。起始位置通常是配送路段或配送区域,而终止位置始终是配送路段。正如我们之前讨论的那样,配送路段实际上是配送单元的关键。无论起始位置和终止位置如何,计算出的路线都必须经过图表上部的至少一个包裹中心。

Route calculations involve finding the cheapest route between two locations in the lower portions of the graph. The starting location is typically a delivery segment or delivery area, whereas the end location is always a delivery segment. A delivery segment, as we discussed earlier, is effectively a key for a delivery unit. Irrespective of the start and end locations, the calculated route must go via at least one parcel center in the upper part of the graph.

格德布 0516
图 5-16.关系上的时间戳属性

在遍历该图时,可将计算分为三段。如图 5-17所示,一段和二段分别从起点和终点位置向上行进,每段都终止于一个配送中心。由于对于任何给定的配送期,图下部任意两个元素之间最多只有一条路线,因此从一个元素遍历到下一个元素只需找到一个传入DELIVERY ROUTE关系,其间隔时间戳包含当前配送期。通过跟踪这些关系,一段和二段的遍历将导航一对以两个不同的配送中心为根的树结构。这两个配送中心随后构成第三段的起点和终点,该段穿过图的上部。

In terms of traversing the graph, a calculation can be split into three legs. Legs one and two, shown in Figure 5-17, work their way upward from the start and end locations, respectively, with each terminating at a delivery center. Because there is at most one route between any two elements in the lower portion of the graph for any given delivery period, traversing from one element to the next is simply a matter of finding an incoming DELIVERY ROUTE relationship whose interval timestamps encompass the current delivery period. By following these relationships, the traversals for legs one and two navigate a pair of tree structures rooted at two different delivery centers. These two delivery centers then form the start and end locations for the third leg, which crosses the upper portion of the graph.

格德布 0517
图 5-17从起点和终点到配送基地的最短路径

与第一段和第二段一样,第三段的遍历(如图5-18所示)寻找关系(这次是CONNECTED_TO关系),其时间戳涵盖当前交付期。然而,即使有了这个时间过滤,对于任何给定的时间段,图表上部的任何两个交付中心之间都可能存在多条路线。因此,第三段遍历必须将每条路线的成本相加,并选择最便宜的路线,使得这是最短加权路径计算。

As with legs one and two, the traversal for leg three, as shown in Figure 5-18, looks for relationships — this time, CONNECTED_TO relationships — whose timestamps encompass the current delivery period. Even with this time filtering in place, however, there are, for any given period, potentially several routes between any two delivery centers in the upper portion of the graph. The third leg traversal must, therefore, sum the cost of each route, and select the cheapest, making this a shortest weighted path calculation.

格德布 0518
图 5-18.配送基地之间的最短路径

为了完成计算,我们只需将第一、三、二段路径相加,即可得到从起点到终点的完整路径。

To complete the calculation, we need then simply to add the paths for legs one, three, and two, which gives the full path from the start to the end location.

使用 Cypher 寻找最短配送路线

Finding the shortest delivery route using Cypher

实现包裹路线计算引擎的Cypher查询如下:

The Cypher query to implement the parcel route calculation engine is as follows:

MATCH (s:Location {name:{startLocation}}),
      (e:Location {name:{endLocation}})
MATCH upLeg = (s)<-[:DELIVERY_ROUTE*1..2]-(db1)
WHERE all(r in relationships(upLeg)
          WHERE r.start_date <= {intervalStart}
          AND r.end_date >= {intervalEnd})
WITH  e, upLeg, db1
MATCH downLeg = (db2)-[:DELIVERY_ROUTE*1..2]->(e)
WHERE all(r in relationships(downLeg)
          WHERE r.start_date <= {intervalStart}
          AND r.end_date >= {intervalEnd})
WITH  db1, db2, upLeg, downLeg
MATCH topRoute = (db1)<-[:CONNECTED_TO]-()-[:CONNECTED_TO*1..3]-(db2)
WHERE all(r in relationships(topRoute)
          WHERE r.start_date <= {intervalStart}
          AND r.end_date >= {intervalEnd})
WITH  upLeg, downLeg, topRoute,
      reduce(weight=0, r in relationships(topRoute) | weight+r.cost) AS score
      ORDER BY score ASC
      LIMIT 1
RETURN (nodes(upLeg) + tail(nodes(topRoute)) + tail(nodes(downLeg))) AS n
MATCH (s:Location {name:{startLocation}}),
      (e:Location {name:{endLocation}})
MATCH upLeg = (s)<-[:DELIVERY_ROUTE*1..2]-(db1)
WHERE all(r in relationships(upLeg)
          WHERE r.start_date <= {intervalStart}
          AND r.end_date >= {intervalEnd})
WITH  e, upLeg, db1
MATCH downLeg = (db2)-[:DELIVERY_ROUTE*1..2]->(e)
WHERE all(r in relationships(downLeg)
          WHERE r.start_date <= {intervalStart}
          AND r.end_date >= {intervalEnd})
WITH  db1, db2, upLeg, downLeg
MATCH topRoute = (db1)<-[:CONNECTED_TO]-()-[:CONNECTED_TO*1..3]-(db2)
WHERE all(r in relationships(topRoute)
          WHERE r.start_date <= {intervalStart}
          AND r.end_date >= {intervalEnd})
WITH  upLeg, downLeg, topRoute,
      reduce(weight=0, r in relationships(topRoute) | weight+r.cost) AS score
      ORDER BY score ASC
      LIMIT 1
RETURN (nodes(upLeg) + tail(nodes(topRoute)) + tail(nodes(downLeg))) AS n

乍一看,这个查询似乎相当复杂。但它是由四个更简单的查询通过WITH子句连接在一起组成的。我们将依次查看每个子查询。

At first glance, this query appears quite complex. It is, however, made up of four simpler queries joined together with WITH clauses. We’ll look at each of these subqueries in turn.

这是第一个子查询:

Here’s the first subquery:

MATCH (s:Location {name:{startLocation}}),
      (e:Location {name:{endLocation}})
MATCH upLeg = (s)<-[:DELIVERY_ROUTE*1..2]-(db1)
WHERE all(r in relationships(upLeg)
          WHERE r.start_date <= {intervalStart}
          AND r.end_date >= {intervalEnd})
MATCH (s:Location {name:{startLocation}}),
      (e:Location {name:{endLocation}})
MATCH upLeg = (s)<-[:DELIVERY_ROUTE*1..2]-(db1)
WHERE all(r in relationships(upLeg)
          WHERE r.start_date <= {intervalStart}
          AND r.end_date >= {intervalEnd})

此查询计算整个路线的第一段行程。它可以分解如下:

This query calculates the first leg of the overall route. It can be broken down as follows:

  • 第一个MATCH在标记为 的节点子集中找到起点和终点位置Location,并分别将它们绑定到se标识符。
  • The first MATCH finds the start and end locations in the subset of nodes labeled Location, binding them to the s and e identifiers, respectively.
  • 第二个使用有向的可变长度路径MATCH找到从起始位置s到配送基地的路线DELIVERY_ROUTE。然后将此路径绑定到标识符upLeg。由于配送基地始终是树的根节点DELIVERY_ROUTE,因此没有传入DELIVERY_ROUTE关系,我们可以确信db1此可变长度路径末尾的节点代表配送基地,而不是其他包裹网络元素。
  • The second MATCH finds the route from the start location, s, to a delivery base using a directed, variable-length DELIVERY_ROUTE path. This path is then bound to the identifier upLeg. Because delivery bases are always the root nodes of DELIVERY_ROUTE trees, and therefore have no incoming DELIVERY_ROUTE relationships, we can be confident that the db1 node at the end of this variable-length path represents a delivery base and not some other parcel network element.
  • WHERE对路径应用额外的约束upLeg,确保我们只匹配和属性包含所提供的交货期的 DELIVERY_ROUTE关系。start_dateend_date
  • WHERE applies additional constraints to the path upLeg, ensuring that we only match DELIVERY_ROUTE relationships whose start_date and end_date properties encompass the supplied delivery period.

第二个子查询计算路线的第二段,即从终点到配送基地的路径,配送基地的DELIVERY_ROUTE树包含该终点作为叶节点。此查询与第一个查询非常相似:

The second subquery calculates the second leg of the route, which comprises the path from the end location up to the delivery base whose DELIVERY_ROUTE tree includes that end location as a leaf node. This query is very similar to the first:

WITH  e, upLeg, db1
MATCH downLeg = (db2)-[:DELIVERY_ROUTE*1..2]->(e)
WHERE all(r in relationships(downLeg)
          WHERE r.start_date <= {intervalStart}
          AND r.end_date >= {intervalEnd})
WITH  e, upLeg, db1
MATCH downLeg = (db2)-[:DELIVERY_ROUTE*1..2]->(e)
WHERE all(r in relationships(downLeg)
          WHERE r.start_date <= {intervalStart}
          AND r.end_date >= {intervalEnd})

此处的子句WITH将第一个子查询链接到第二个子查询,将终点位置和第一段行程的路径和交付基地传输到第二个子查询。第二个子查询e在其MATCH子句中仅使用终点位置;其余部分提供以便可以将其传输到后续查询。

The WITH clause here chains the first subquery to the second, piping the end location and the first leg’s path and delivery base to the second subquery. The second subquery uses only the end location, e, in its MATCH clause; the rest is provided so that it can be piped to subsequent queries.

第三个子查询确定了路线第三段(即配送基地之间的路线)的所有db1候选路径,db2如下所示:

The third subquery identifies all candidate paths for the third leg of the route — that is, the route between delivery bases db1 and db2 — as follows:

WITH  db1, db2, upLeg, downLeg
MATCH topRoute = (db1)<-[:CONNECTED_TO]-()-[:CONNECTED_TO*1..3]-(db2)
WHERE all(r in relationships(topRoute)
          WHERE r.start_date <= {intervalStart}
          AND r.end_date >= {intervalEnd})
WITH  db1, db2, upLeg, downLeg
MATCH topRoute = (db1)<-[:CONNECTED_TO]-()-[:CONNECTED_TO*1..3]-(db2)
WHERE all(r in relationships(topRoute)
          WHERE r.start_date <= {intervalStart}
          AND r.end_date >= {intervalEnd})

该子查询分解如下:

This subquery is broken down as follows:

  • WITH将此子查询链接到前一个子查询,将交付基础db1db2第一段和第二段中标识的路径一起传输到当前查询。
  • WITH chains this subquery to the previous one, piping delivery bases db1 and db2 together with the paths identified in legs one and two to the current query.
  • MATCH识别第一和第二个交付基地之间的所有topRoute路径(最大深度为四条),并将它们绑定到标识符。
  • MATCH identifies all paths between the first and second leg delivery bases, to a maximum depth of four, and binds them to the topRoute identifier.
  • WHEREtopRoute路径限制为start_dateend_date属性包含所供应的交货期的路径。
  • WHERE constrains the topRoute paths to those whose start_date and end_date properties encompass the supplied delivery period.

第四个也是最后一个子查询选择第三段的最短路径,然后计算整体路线:

The fourth and final subquery selects the shortest path for leg three, and then calculates the overall route:

WITH  upLeg, downLeg, topRoute,
      reduce(weight=0, r in relationships(topRoute) | weight+r.cost) AS score
      ORDER BY score ASC
      LIMIT 1
RETURN (nodes(upLeg) + tail(nodes(topRoute)) + tail(nodes(downLeg))) AS n
WITH  upLeg, downLeg, topRoute,
      reduce(weight=0, r in relationships(topRoute) | weight+r.cost) AS score
      ORDER BY score ASC
      LIMIT 1
RETURN (nodes(upLeg) + tail(nodes(topRoute)) + tail(nodes(downLeg))) AS n

该子查询的工作原理如下:

This subquery works as follows:

  • WITH将一个或多个三元组(由upLegdownLegtopRoute路径组成)通过管道传输到当前查询。第三个子查询匹配的每个路径都会有一个三元组,每个路径都会绑定到连续的三元组(绑定到和 的topRoute路径将保持不变,因为第一个和第二个子查询分别只匹配一个路径)。每个三元组都伴随着该三元组绑定到的路径的。此分数是使用 Cypher 的函数计算的,该函数针对每个三元组计算当前绑定到 的路径中的关系的属性之和。然后按此 排序三元组,最低的排在第一位,然后限制为排序列表中的第一个三元组。 upLegdownLegscoretopRoutereducecosttopRoutescore
  • WITH pipes one or more triples, comprising upLeg, downLeg, and topRoute paths, to the current query. There will be one triple for each of the paths matched by the third subquery, with each path being bound to topRoute in successive triples (the paths bound to upLeg and downLeg will stay the same, because the first and second subqueries matched only one path each). Each triple is accompanied by a score for the path bound to topRoute for that triple. This score is calculated using Cypher’s reduce function, which for each triple sums the cost properties on the relationships in the path currently bound to topRoute. The triples are then ordered by this score, lowest first, and then limited to the first triple in the sorted list.
  • RETURNupLeg对路径、topRoute和中的节点求和,downLeg以得出最终结果。该tail函数会删除每条路径topRoute和中的第一个节点downLeg,因为该节点已存在于前一条路径中。
  • RETURN sums the nodes in the paths upLeg, topRoute, and downLeg to produce the final results. The tail function drops the first node in each of the paths topRoute and downLeg, because that node will already be present in the preceding path.

使用遍历框架实现路线计算

Implementing route calculation with the Traversal Framework

路线计算的时间紧迫性对路线计算引擎提出了严格的要求。只要单个查询延迟足够低,就始终可以水平扩展以提高吞吐量。基于 Cypher 的解决方案速度很快,但由于每秒有数千个包裹进入网络,因此每毫秒都会影响集群占用空间。因此,Global Post 采用了另一种方法:使用 Neo4j 的 Traversal Framework 计算路线。

The time-critical nature of the route calculation imposes strict demands on the route calculation engine. As long as the individual query latencies are low enough, it’s always possible to scale horizontally for increased throughput. The Cypher-based solution is fast, but with many thousands of parcels entering the network each second, every millisecond impacts the cluster footprint. For this reason, Global Post adopted an alternative approach: calculating routes using Neo4j’s Traversal Framework.

基于遍历的路线计算引擎实现需要解决两个问题:寻找最短路径,以及根据时间段筛选路径。我们先来看看如何根据时间段筛选路径。

A traversal-based implementation of the route calculation engine must solve two problems: finding the shortest paths, and filtering the paths based on time period. We’ll look at how we filter paths based on time period first.

遍历应仅遵循在指定交付期内有效的关系。换句话说,随着遍历在图中前进,应仅显示有效期(由其start_dateend_date属性定义)包含指定交付期的关系。

Traversals should only follow relationships that are valid for the specified delivery period. In other words, as the traversal progresses through the graph, it should be presented with only those relationships whose periods of validity, as defined by their start_date and end_date properties, contain the specified delivery period.

我们使用 实现这种关系过滤PathExpander。给定从遍历的起始节点到当前所在节点的路径, 的PathExpander方法expand()返回可用于进一步遍历的关系。每次遍历框架将另一个节点推进到图中时,都会调用此方法。如果需要,客户端可以为遍历提供一些初始状态(称为分支状态)。该方法可以在决定expand()返回哪些关系的过程中使用(甚至更改)提供的分支状态。路线计算器的ValidPathExpander实现使用此分支状态向扩展器提供交付期。

We implement this relationship filtering using a PathExpander. Given a path from a traversal’s start node to the node where it is currently positioned, a PathExpander’s expand() method returns the relationships that can be used to traverse further. This method is called by the Traversal Framework each time the framework advances another node into the graph. If needed, the client can supply some initial state, called the branch state, to the traversal. The expand() method can use (and even change) the supplied branch state in the course of deciding which relationships to return. The route calculator’s ValidPathExpander implementation uses this branch state to supply the delivery period to the expander.

以下是的代码ValidPathExpander

Here’s the code for the ValidPathExpander:

private static class ValidPathExpander implements PathExpander<Interval>
{
  private final RelationshipType relationshipType;
  private final Direction direction;

  private ValidPathExpander( RelationshipType relationshipType,
                             Direction direction )
  {
      this.relationshipType = relationshipType;
      this.direction = direction;
  }

  @Override
  public Iterable<Relationship> expand( Path path,
                                        BranchState<Interval> deliveryInterval )
  {
      List<Relationship> results = new ArrayList<Relationship>();
      for ( Relationship r : path.endNode()
                             .getRelationships( relationshipType, direction ) )
      {
          Interval relationshipInterval = new Interval(
              (Long) r.getProperty( "start_date" ),
              (Long) r.getProperty( "end_date" ) );
          if ( relationshipInterval.contains( deliveryInterval.getState() ) )
          {
              results.add( r );
          }
      }

      return results;
  }
}
private static class ValidPathExpander implements PathExpander<Interval>
{
  private final RelationshipType relationshipType;
  private final Direction direction;

  private ValidPathExpander( RelationshipType relationshipType,
                             Direction direction )
  {
      this.relationshipType = relationshipType;
      this.direction = direction;
  }

  @Override
  public Iterable<Relationship> expand( Path path,
                                        BranchState<Interval> deliveryInterval )
  {
      List<Relationship> results = new ArrayList<Relationship>();
      for ( Relationship r : path.endNode()
                             .getRelationships( relationshipType, direction ) )
      {
          Interval relationshipInterval = new Interval(
              (Long) r.getProperty( "start_date" ),
              (Long) r.getProperty( "end_date" ) );
          if ( relationshipInterval.contains( deliveryInterval.getState() ) )
          {
              results.add( r );
          }
      }

      return results;
  }
}

ValidPathExpander构造函数有两个参数:arelationshipType和 a direction。这允许扩展器重复用于不同类型的关系。在 Global Post 图的情况下,扩展器将用于过滤CONNECTED_TODELIVERY_ROUTE关系。

The ValidPathExpander’s constructor takes two arguments: a relationshipType and a direction. This allows the expander to be reused for different types of relationships. In the case of the Global Post graph, the expander will be used to filter both CONNECTED_TO and DELIVERY_ROUTE relationships.

扩展器的expand()方法path将从遍历的起始节点延伸到遍历当前所在节点的 和deliveryInterval客户端提供的分支状态作为参数。每次调用时,expand()都会迭代当前节点上的相关关系(当前节点由 给出path.endNode())。对于每个关系,该方法会将关系的间隔与提供的传递间隔进行比较。如果关系的间隔包含传递间隔,则将关系添加到结果中。

The expander’s expand() method takes as parameters the path that extends from the traversal’s start node to the node on which the traversal is currently positioned, and the deliveryInterval branch state as supplied by the client. Each time it is called, expand() iterates the relevant relationships on the current node (the current node is given by path.endNode()). For each relationship, the method then compares the relationship’s interval with the supplied delivery interval. If the relationship’s interval contains the delivery interval, the relationship is added to the results.

看过了 之后ValidPathExpander,我们现在可以转向 本身ParcelRouteCalculator。此类封装了计算包裹进入网络的点和最终交付目的地之间的路线所需的所有逻辑。它采用与我们已经看过的 Cypher 查询类似的策略。也就是说,它从起始节点和结束节点以两次单独的遍历方式向上移动图表,直到找到每一段的交付基地。然后,它执行连接这两个交付基地的最短加权路径搜索。

Having looked at the ValidPathExpander, we can now turn to the ParcelRouteCalculator itself. This class encapsulates all the logic necessary to calculate a route between the point where a parcel enters the network and the final delivery destination. It employs a similar strategy to the Cypher query we’ve already looked at. That is, it works its way up the graph from both the start node and the end node in two separate traversals, until it finds a delivery base for each leg. It then performs a shortest weighted path search that joins these two delivery bases.

以下是课程的开始ParcelRouteCalculator

Here’s the beginning of the ParcelRouteCalculator class:

public class ParcelRouteCalculator
{
    private static final PathExpander<Interval> DELIVERY_ROUTE_EXPANDER =
        new ValidPathExpander( withName( "DELIVERY_ROUTE" ),
                               Direction.INCOMING );

    private static final PathExpander<Interval> CONNECTED_TO_EXPANDER =
        new ValidPathExpander( withName( "CONNECTED_TO" ),
                               Direction.BOTH );

    private static final TraversalDescription DELIVERY_BASE_FINDER =
        Traversal.description()
            .depthFirst()
            .evaluator( new Evaluator()
            {
                private final RelationshipType DELIVERY_ROUTE =
                    withName( "DELIVERY_ROUTE");

                @Override
                public Evaluation evaluate( Path path )
                {
                    if ( isDeliveryBase( path ) )
                    {
                        return Evaluation.INCLUDE_AND_PRUNE;
                    }

                    return Evaluation.EXCLUDE_AND_CONTINUE;
                }

                private boolean isDeliveryBase( Path path )
                {
                    return !path.endNode().hasRelationship(
                        DELIVERY_ROUTE, Direction.INCOMING );
                }
            } );

    private static final CostEvaluator<Double> COST_EVALUATOR =
        CommonEvaluators.doubleCostEvaluator( "cost" );
    public static final Label LOCATION = DynamicLabel.label("Location");
    private GraphDatabaseService db;

    public ParcelRouteCalculator( GraphDatabaseService db )
    {
        this.db = db;
    }
    ...
}
public class ParcelRouteCalculator
{
    private static final PathExpander<Interval> DELIVERY_ROUTE_EXPANDER =
        new ValidPathExpander( withName( "DELIVERY_ROUTE" ),
                               Direction.INCOMING );

    private static final PathExpander<Interval> CONNECTED_TO_EXPANDER =
        new ValidPathExpander( withName( "CONNECTED_TO" ),
                               Direction.BOTH );

    private static final TraversalDescription DELIVERY_BASE_FINDER =
        Traversal.description()
            .depthFirst()
            .evaluator( new Evaluator()
            {
                private final RelationshipType DELIVERY_ROUTE =
                    withName( "DELIVERY_ROUTE");

                @Override
                public Evaluation evaluate( Path path )
                {
                    if ( isDeliveryBase( path ) )
                    {
                        return Evaluation.INCLUDE_AND_PRUNE;
                    }

                    return Evaluation.EXCLUDE_AND_CONTINUE;
                }

                private boolean isDeliveryBase( Path path )
                {
                    return !path.endNode().hasRelationship(
                        DELIVERY_ROUTE, Direction.INCOMING );
                }
            } );

    private static final CostEvaluator<Double> COST_EVALUATOR =
        CommonEvaluators.doubleCostEvaluator( "cost" );
    public static final Label LOCATION = DynamicLabel.label("Location");
    private GraphDatabaseService db;

    public ParcelRouteCalculator( GraphDatabaseService db )
    {
        this.db = db;
    }
    ...
}

这里我们定义了两个扩展器(一个用于DELIVERY_ROUTE关系,另一个用于CONNECTED_TO关系)以及用于查找路线两条支线的遍历。只要遇到没有传入DELIVERY_ROUTE关系的节点,此遍历就会终止。由于每个配送基地都位于配送路线树的根部,因此我们可以推断,在我们的图中,没有任何传入关系的节点DELIVERY_ROUTE代表配送基地。

Here we define two expanders — one for DELIVERY_ROUTE relationships, another for CONNECTED_TO relationships — and the traversal that will find the two legs of our route. This traversal terminates whenever it encounters a node with no incoming DELIVERY_ROUTE relationships. Because each delivery base is at the root of a delivery route tree, we can infer that a node without any incoming DELIVERY_ROUTE relationships represents a delivery base in our graph.

每个路线计算引擎都维护此路线计算器的单个实例。此实例能够处理多个请求。对于要计算的每条路线,客户端都会调用计算器的calculateRoute()方法,传入起点和终点的名称以及要计算路线的间隔:

Each route calculation engine maintains a single instance of this route calculator. This instance is capable of servicing multiple requests. For each route to be calculated, the client calls the calculator’s calculateRoute() method, passing in the names of the start and end points, and the interval for which the route is to be calculated:

public Iterable<Node> calculateRoute( String start,
                                      String end,
                                      Interval interval )
{
    try ( Transaction tx = db.beginTx() )
    {
        TraversalDescription deliveryBaseFinder =
            createDeliveryBaseFinder( interval );

        Path upLeg = findRouteToDeliveryBase( start, deliveryBaseFinder );
        Path downLeg = findRouteToDeliveryBase( end, deliveryBaseFinder );

        Path topRoute = findRouteBetweenDeliveryBases(
            upLeg.endNode(),
            downLeg.endNode(),
            interval );

        Set<Node> routes = combineRoutes(upLeg, downLeg, topRoute);
        tx.success();
        return routes;
    }
}
public Iterable<Node> calculateRoute( String start,
                                      String end,
                                      Interval interval )
{
    try ( Transaction tx = db.beginTx() )
    {
        TraversalDescription deliveryBaseFinder =
            createDeliveryBaseFinder( interval );

        Path upLeg = findRouteToDeliveryBase( start, deliveryBaseFinder );
        Path downLeg = findRouteToDeliveryBase( end, deliveryBaseFinder );

        Path topRoute = findRouteBetweenDeliveryBases(
            upLeg.endNode(),
            downLeg.endNode(),
            interval );

        Set<Node> routes = combineRoutes(upLeg, downLeg, topRoute);
        tx.success();
        return routes;
    }
}

calculateRoute()首先获取deliveryBaseFinder指定间隔的 ,然后使用它来查找两条线路的路线。接下来,它会找到每条线路顶部的配送基地之间的路线,这些是每条线路路径中的最后节点。最后,它会将这些路线组合起来以生成最终结果。

calculateRoute() first obtains a deliveryBaseFinder for the specified interval, which it then uses to find the routes for the two legs. Next, it finds the route between the delivery bases at the top of each leg, these being the last nodes in each leg’s path. Finally, it combines these routes to generate the final results.

辅助方法createDeliveryBaseFinder()创建一个用提供的间隔配置的遍历描述:

The createDeliveryBaseFinder() helper method creates a traversal description configured with the supplied interval:

private TraversalDescription createDeliveryBaseFinder( Interval interval )
{
    return DELIVERY_BASE_FINDER.expand( DELIVERY_ROUTE_EXPANDER,
        new InitialBranchState.State<>( interval, interval ) );
}
private TraversalDescription createDeliveryBaseFinder( Interval interval )
{
    return DELIVERY_BASE_FINDER.expand( DELIVERY_ROUTE_EXPANDER,
        new InitialBranchState.State<>( interval, interval ) );
}

此遍历描述是通过使用 扩展ParcelRouteCalculator的静态DELIVERY_BASE_FINDER遍历描述而构建的DELIVERY_ROUTE_EXPANDER。此时,扩展器的分支状态使用客户端提供的间隔进行初始化。这使我们能够DELIVERY_BASE_FINDER对多个请求使用相同的基本遍历描述实例 ( )。此基本描述针对每个请求进行扩展和参数化。

This traversal description is built by expanding the ParcelRouteCalculator’s static DELIVERY_BASE_FINDER traversal description using the DELIVERY_ROUTE_EXPANDER. The branch state for the expander is initialized at this point with the interval supplied by the client. This enables us to use the same base traversal description instance (DELIVERY_BASE_FINDER) for multiple requests. This base description is expanded and parameterized for each request.

正确配置间隔后,将遍历描述提供给findRouteToDeliveryBase(),它在位置索引中查找起始节点,然后执行遍历:

Properly configured with an interval, the traversal description is then supplied to findRouteToDeliveryBase(), which looks up a starting node in the location index, and then executes the traversal:

private Path findRouteToDeliveryBase( String startPosition,
                                      TraversalDescription deliveryBaseFinder )
{
  Node startNode = IteratorUtil.single(
	    db.findNodesByLabelAndProperty(LOCATION, "name", startPosition));
  return deliveryBaseFinder.traverse( startNode ).iterator().next();
}
private Path findRouteToDeliveryBase( String startPosition,
                                      TraversalDescription deliveryBaseFinder )
{
  Node startNode = IteratorUtil.single(
	    db.findNodesByLabelAndProperty(LOCATION, "name", startPosition));
  return deliveryBaseFinder.traverse( startNode ).iterator().next();
}

这就是两条线路的安排。计算的最后一部分要求我们找到每条线路顶部配送基地之间的最短路径。calculateRoute()从每条线路的路径中取出最后一个节点,并将这两个节点与客户端提供的间隔一起提供给findRouteBetweenDeliveryBases()。以下是 的实现findRouteBetweenDeliveryBases()

That’s the two legs taken care of. The last part of the calculation requires us to find the shortest path between the delivery bases at the top of each of the legs. calculateRoute() takes the last node from each leg’s path, and supplies these two nodes together with the client-supplied interval to findRouteBetweenDeliveryBases(). Here’s the implementation of findRouteBetweenDeliveryBases():

private Path findRouteBetweenDeliveryBases( Node deliveryBase1,
                                            Node deliveryBase2,
                                            Interval interval )
{
    PathFinder<WeightedPath> routeBetweenDeliveryBasesFinder =
        GraphAlgoFactory.dijkstra(
            CONNECTED_TO_EXPANDER,
            new InitialBranchState.State<>( interval, interval ),
            COST_EVALUATOR );

    return routeBetweenDeliveryBasesFinder
        .findSinglePath( deliveryBase1, deliveryBase2 );
}
private Path findRouteBetweenDeliveryBases( Node deliveryBase1,
                                            Node deliveryBase2,
                                            Interval interval )
{
    PathFinder<WeightedPath> routeBetweenDeliveryBasesFinder =
        GraphAlgoFactory.dijkstra(
            CONNECTED_TO_EXPANDER,
            new InitialBranchState.State<>( interval, interval ),
            COST_EVALUATOR );

    return routeBetweenDeliveryBasesFinder
        .findSinglePath( deliveryBase1, deliveryBase2 );
}

而不是使用遍历描述来查找两个节点之间的最短路线,此方法使用 Neo4j 的图形算法库中的最短加权路径算法 - 在本例中,我们使用 Dijkstra 算法(有关 Dijkstra 算法的更多详细信息,请参阅“使用 Dijkstra 算法进行路径查找”)。此算法配置有ParcelRouteCalculator的静态CONNECTED_TO_EXPANDER,而后者又使用客户端提供的分支状态间隔进行初始化。该算法还配置了成本评估器(另一个静态成员),它仅标识关系上表示该关系的权重或成本的属性。对findSinglePathDijkstra 路径查找器进行调用将返回两个交付基地之间的最短路径。

Rather than use a traversal description to find the shortest route between two nodes, this method uses a shortest weighted path algorithm from Neo4j’s graph algorithm library — in this instance, we’re using the Dijkstra algorithm (see “Path-Finding with Dijkstra’s Algorithm” for more details on the Dijkstra algorithm). This algorithm is configured with ParcelRouteCalculator’s static CONNECTED_TO_EXPANDER, which in turn is initialized with the client-supplied branch state interval. The algorithm is also configured with a cost evaluator (another static member), which simply identifies the property on a relationship representing that relationship’s weight or cost. A call to findSinglePath on the Dijkstra path finder returns the shortest path between the two delivery bases.

这就是最难的工作。剩下的就是把这些路线连接起来形成最终结果。这相对简单,唯一的问题是必须反转下行航段的路径才能将其添加到结果中(航段是从最终目的地向上计算的,而它应该出现在成果交付底部向下):

That’s the hard work done. All that remains is to join these routes to form the final results. This is relatively straightforward, the only wrinkle being that the down leg’s path must be reversed before being added to the results (the leg was calculated from final destination upward, whereas it should appear in the results delivery base downward):

private Set<Node> combineRoutes( Path upLeg,
                                 Path downLeg,
                                 Path topRoute )
{
    LinkedHashSet<Node> results = new LinkedHashSet<>();
    results.addAll( IteratorUtil.asCollection( upLeg.nodes() ));
    results.addAll( IteratorUtil.asCollection( topRoute.nodes() ));
    results.addAll( IteratorUtil.asCollection( downLeg.reverseNodes() ));
    return results;
}
private Set<Node> combineRoutes( Path upLeg,
                                 Path downLeg,
                                 Path topRoute )
{
    LinkedHashSet<Node> results = new LinkedHashSet<>();
    results.addAll( IteratorUtil.asCollection( upLeg.nodes() ));
    results.addAll( IteratorUtil.asCollection( topRoute.nodes() ));
    results.addAll( IteratorUtil.asCollection( downLeg.reverseNodes() ));
    return results;
}

概括

Summary

在本章中,我们研究了图形数据库的一些常见实际用例,并详细描述了三个案例研究,展示了如何使用图形数据库来构建社交网络、实现访问控制和管理复杂的物流计算。

In this chapter, we’ve looked at some common real-world use cases for graph databases, and described in detail three case studies that show how graph databases have been used to build a social network, implement access control, and manage complex logistics calculations.

在下一章中,我们将深入探讨图形数据库的内部结构。在最后一章中,我们将介绍一些用于处理图形数据的分析技术和算法。

In the next chapter, we dive deeper into the internals of a graph database. In the concluding chapter, we look at some analytical techniques and algorithms for processing graph data.

1请参阅 Nicholas Christakis 和 James Fowler合著的《互联:社交网络的惊人力量及其如何塑造我们的生活》(HarperPress,2011 年)。

1 See Nicholas Christakis and James Fowler, Connected: The Amazing Power of Social Networks and How They Shape Our Lives (HarperPress, 2011).

2 Neo4j Spatial是一个开源实用程序库,可实现空间索引并将 Neo4j 数据公开给地理工具。

2 Neo4j Spatial is an open source library of utilities that implement spatial indexes and expose Neo4j data to geotools.

第 6 章图形数据库内部原理

Chapter 6. Graph Database Internals

在本章中,我们将深入探讨图形数据库的实现,并展示它们与其他存储和查询复杂、结构可变、紧密连接数据的方法有何不同。虽然确实不存在单一的通用架构模式,即使在图形数据库中也是如此,但本章描述了图形数据库中最常见的架构模式和组件。

In this chapter, we peek under the hood and discuss the implementation of graph databases, showing how they differ from other means of storing and querying complex, variably-structured, densely connected data. Although it is true that no single universal architecture pattern exists, even among graph databases, this chapter describes the most common architecture patterns and components you can expect to find inside a graph database.

我们在本章中使用 Neo4j 图形数据库来说明讨论,原因如下。Neo4j 是一个具有原生处理能力和原生图形存储能力的图形数据库(有关原生图形处理和存储的讨论,请参见第 1 章)。除了是撰写本文时使用最广泛的图形数据库之外,它还具有开源的透明优势,让喜欢冒险的读者可以轻松深入了解并检查代码。最后,它是作者非常熟悉的数据库。

We illustrate the discussion in this chapter using the Neo4j graph database, for several reasons. Neo4j is a graph database with native processing capabilities as well as native graph storage (see Chapter 1 for a discussion of native graph processing and storage). In addition to being the most common graph database in use at the time of writing, it has the transparency advantage of being open source, making it easy for the adventuresome reader to go a level deeper and inspect the code. Finally, it is a database the authors know well.

原生图形处理

Native Graph Processing

在本书中,我们多次讨论了属性图模型。现在您应该熟悉其通过命名和有向关系连接起来的节点概念,节点和关系都充当属性的容器。尽管模型本身在图形数据库实现中相当一致,但仍有许多方法可以在数据库引擎的主内存中对图形进行编码和表示。在众多不同的引擎架构中,如果图形数据库表现出属性,我们就说它具有本机处理能力称为无索引邻接

We’ve discussed the property graph model several times throughout this book. By now you should be familiar with its notion of nodes connected by way of named and directed relationships, with both the nodes and relationships serving as containers for properties. Although the model itself is reasonably consistent across graph database implementations, there are numerous ways to encode and represent the graph in the database engine’s main memory. Of the many different engine architectures, we say that a graph database has native processing capabilities if it exhibits a property called index-free adjacency.

使用无索引邻接关系的数据库引擎是每个节点都维护对其相邻节点的直接引用的引擎。因此,每个节点都充当其他附近节点的微索引,这比使用全局索引便宜得多。这意味着查询时间与图的总大小无关,而是与搜索的图量成正比。

A database engine that utilizes index-free adjacency is one in which each node maintains direct references to its adjacent nodes. Each node, therefore, acts as a micro-index of other nearby nodes, which is much cheaper than using global indexes. It means that query times are independent of the total size of the graph, and are instead simply proportional to the amount of the graph searched.

相比之下,非原生图形数据库引擎使用(全局)索引将节点链接在一起,如图6-1所示。这些索引为每次遍历增加了一层间接层,从而增加了计算成本。原生图形处理的支持者认为,无索引邻接对于快速、高效的图形遍历至关重要。

A nonnative graph database engine, in contrast, uses (global) indexes to link nodes together, as shown in Figure 6-1. These indexes add a layer of indirection to each traversal, thereby incurring greater computational cost. Proponents for native graph processing argue that index-free adjacency is crucial for fast, efficient graph traversals.


笔记

要理解为什么原生图处理比基于大量索引的图效率高得多,请考虑以下内容。根据实现方式,索引查找的算法复杂度可能为O(log n) ,而查找直接关系的算法复杂度为O(1) 。要遍历m步网络,索引方法的成本为O(m log n) ,而使用无索引邻接的实现的成本为O(m) ,这相形见绌。

To understand why native graph processing is so much more efficient than graphs based on heavy indexing, consider the following. Depending on the implementation, index lookups could be O(log n) in algorithmic complexity versus O(1) for looking up immediate relationships. To traverse a network of m steps, the cost of the indexed approach, at O(m log n), dwarfs the cost of O(m) for an implementation that uses index-free adjacency.


格数据库 0601
图 6-1非原生图形处理引擎使用索引在节点之间进行遍历

图 6-1显示了非本地图处理方法的工作原理。要找到 Alice 的朋友,我们首先必须执行索引查找,其成本为O(log n)。这对于偶尔或浅显的查找来说可能是可以接受的,但当我们反转遍历方向时,它很快就会变得昂贵。如果我们不寻找 Alice 的朋友,而是想找出谁是 Alice 的朋友,我们就必须执行多次索引查找,每个可能与 Alice 是朋友的节点都执行一次。这使得成本更加沉重。找出谁是 Alice 的朋友的成本为O(log n) ,而找出谁是 Alice 的朋友的成本为O(m log n) 。

Figure 6-1 shows how a nonnative approach to graph processing works. To find Alice’s friends we have first to perform an index lookup, at cost O(log n). This may be acceptable for occasional or shallow lookups, but it quickly becomes expensive when we reverse the direction of the traversal. If, instead of finding Alice’s friends, we wanted to find out who is friends with Alice, we would have to perform multiple index lookups, one for each node that is potentially friends with Alice. This makes the cost far more onerous. Whereas it’s O(log n) cost to find out who are Alice’s friends, it’s O(m log n) to find out who is friends with Alice.

索引查找适用于小型网络(如图6-1中的网络),但对于大型图的查询来说成本太高。具有原生图处理功能的图数据库不使用索引查找来在查询时履行关系的角色,而是使用无索引邻接来确保高性能遍历。图 6-2显示了关系如何消除对索引查找的需求。

Index lookups can work for small networks, such as the one in Figure 6-1, but are far too costly for queries over larger graphs. Instead of using index lookups to fulfill the role of relationships at query time, graph databases with native graph processing capabilities use index-free adjacency to ensure high-performance traversals. Figure 6-2 shows how relationships eliminate the need for index lookups.

格数据库 0602
图 6-2 Neo4j 使用关系(而不是索引)进行快速遍历

回想一下,在通用图形数据库中,关系可以以极低的成本在任意方向上(从尾到头,或从头到尾)遍历。如图6-2所示,要使用图形查找 Alice 的朋友,我们只需跟踪她的传出FRIEND关系,每个关系的成本为O(1)。要查找谁是 Alice 的朋友,我们只需跟踪 Alice 的所有传入FRIEND关系到它们的来源,每个关系的成本同样为O(1)

Recall that in a general-purpose graph database, relationships can be traversed in either direction (tail to head, or head to tail) extremely cheaply. As we see in Figure 6-2, to find Alice’s friends using a graph, we simply follow her outgoing FRIEND relationships, at O(1) cost each. To find who is friends with Alice, we simply follow all of Alice’s incoming FRIEND relationships to their source, again at O(1) cost each.

考虑到这些成本,很明显,至少在理论上,图遍历可以非常高效。但这种高性能遍历只有在为此目的而设计的架构支持时才能成为现实。

Given these costs, it’s clear that, in theory at least, graph traversals can be very efficient. But such high-performance traversals only become reality when they are supported by an architecture designed for that purpose.

原生图存储

Native Graph Storage

如果无索引邻接是高性能遍历、查询和写入的关键,那么图形数据库设计的一个关键方面就是图形的存储方式。高效的原生图形存储格式支持任意图形算法的极快速遍历——这是使用图形的重要原因。为了便于说明,我们将使用 Neo4j 数据库作为图形数据库架构的示例。

If index-free adjacency is the key to high-performance traversals, queries, and writes, then one key aspect of the design of a graph database is the way in which graphs are stored. An efficient, native graph storage format supports extremely rapid traversals for arbitrary graph algorithms — an important reason for using graphs. For illustrative purposes we’ll use the Neo4j database as an example of how a graph database is architected.

首先,让我们通过查看 Neo4j 的高级架构(如图 6-3所示)来理解我们的讨论背景。接下来,我们将自下而上地进行,从磁盘上的文件开始,通过编程 API,直到 Cypher 查询语言。在此过程中,我们将讨论 Neo4j 的性能和可靠性特征,以及使 Neo4j 成为高性能、可靠的图形数据库的设计决策。

First, let’s contextualize our discussion by looking at Neo4j’s high-level architecture, presented in Figure 6-3. In what follows we’ll work bottom-up, from the files on disk, through the programmatic APIs, and up to the Cypher query language. Along the way we’ll discuss the performance and dependability characteristics of Neo4j, and the design decisions that make Neo4j a performant, reliable graph database.

格数据库 0603
图 6-3.Neo4j架构

Neo4j 将图形数据存储在多个不同的存储文件。每个存储文件包含图的特定部分的数据(例如,节点、关系、标签和属性有单独的存储)。存储职责的划分(特别是图结构与属性数据的分离)有助于实现高性能图遍历,即使这意味着用户对图的看法和磁盘上的实际记录在结构上是不同的。让我们从查看磁盘上的节点和关系的结构开始探索物理存储,如图6-4所示。1

Neo4j stores graph data in a number of different store files. Each store file contains the data for a specific part of the graph (e.g., there are separate stores for nodes, relationships, labels, and properties). The division of storage responsibilities — particularly the separation of graph structure from property data — facilitates performant graph traversals, even though it means the user’s view of their graph and the actual records on disk are structurally dissimilar. Let’s start our exploration of physical storage by looking at the structure of nodes and relationships on disk as shown in Figure 6-4.1

格数据库 0604
图 6-4. Neo4j 节点和关系存储文件记录结构

节点存储文件存储节点记录。在用户级图中创建的每个节点最终都会进入节点存储,其物理文件为neostore.nodestore.db。与大多数 Neo4j 存储文件一样,节点存储是一个固定大小的记录存储,其中每条记录的长度为 9 个字节。固定大小的记录支持快速查找存储文件中的节点。如果我们有一个 id 为 的节点100,那么我们知道它的记录从文件的第 900 个字节开始。基于这种格式,数据库可以直接计算记录的位置,成本为O(1),而不是执行搜索,搜索的成本为O(log n)

The node store file stores node records. Every node created in the user-level graph ends up in the node store, the physical file for which is neostore.nodestore.db. Like most of the Neo4j store files, the node store is a fixed-size record store, where each record is nine bytes in length. Fixed-size records enable fast lookups for nodes in the store file. If we have a node with id 100, then we know its record begins 900 bytes into the file. Based on this format, the database can directly compute a record’s location, at cost O(1), rather than performing a search, which would be cost O(log n).

节点记录的第一个字节是使用中标志。这告诉数据库该记录当前是否正在用于存储节点,或者是否可以代表新节点回收它(Neo4j 的.id文件会跟踪未使用的记录)。接下来的四个字节表示连接到节点的第一个关系的 ID,接下来的四个字节表示节点的第一个属性的 ID。标签的五个字节指向此节点的标签存储(标签相对较少时可以内联)。最后一个字节extra为标志保留。一个这样的标志用于识别密集连接的节点,其余空间保留以备将来使用。节点记录非常轻量级:它实际上只是指向关系、标签和属性列表的少数指针。

The first byte of a node record is the in-use flag. This tells the database whether the record is currently being used to store a node, or whether it can be reclaimed on behalf of a new node (Neo4j’s .id files keep track of unused records). The next four bytes represent the ID of the first relationship connected to the node, and the following four bytes represent the ID of the first property for the node. The five bytes for labels point to the label store for this node (labels can be inlined where there are relatively few of them). The final byte extra is reserved for flags. One such flag is used to identify densely connected nodes, and the rest of the space is reserved for future use. The node record is quite lightweight: it’s really just a handful of pointers to lists of relationships, labels, and properties.

相应地,关系存储在关系存储文件中neostore.relationshipstore.db。与节点存储一样,关系存储也由固定大小的记录组成。每个关系记录包含关系开始和结束节点的 ID、指向关系类型的指针(存储在关系类型存储中)、每个开始和结束节点的下一个和上一个关系记录的指针,以及指示当前记录是否是通常称为关系链.

Correspondingly, relationships are stored in the relationship store file, neostore.relationshipstore.db. Like the node store, the relationship store also consists of fixed-sized records. Each relationship record contains the IDs of the nodes at the start and end of the relationship, a pointer to the relationship type (which is stored in the relationship type store), pointers for the next and previous relationship records for each of the start and end nodes, and a flag indicating whether the current record is the first in what’s often called the relationship chain.


笔记

节点和关系存储仅关注图的结构,而不是其属性数据。这两个存储都使用固定大小的记录,因此可以根据其 ID 快速计算出存储文件中任何单个记录的位置。这些是关键的设计决策,凸显了 Neo4j 对高性能遍历的承诺。

The node and relationship stores are concerned only with the structure of the graph, not its property data. Both stores use fixed-sized records so that any individual record’s location within a store file can be rapidly computed given its ID. These are critical design decisions that underline Neo4j’s commitment to high-performance traversals.


在图 6-5中,我们可以看到各种存储文件在磁盘上的交互方式。两个节点记录中的每一个都包含指向该节点的第一个属性和关系链中的第一个关系的指针。要读取节点的属性,我们按照单链表结构从指向第一个属性的指针开始。要找到节点的关系,我们按照该节点的关系指针到它的第一个关系(LIKES本例中的关系)。从这里开始,我们按照该特定节点的关系的双向链表(即起始节点双向链表或终止节点双向链表)进行操作,直到找到我们感兴趣的关系。找到我们想要的关系的记录后,我们可以使用与用于节点属性的相同单链表结构读取该关系的属性(如果有),或者我们可以使用其起始节点和终止节点 ID 检查关系连接的两个节点的节点记录。这些 ID 乘以节点记录大小,得出每个节点在节点存储文件中的即时偏移量。

In Figure 6-5, we see how the various store files interact on disk. Each of the two node records contains a pointer to that node’s first property and first relationship in a relationship chain. To read a node’s properties, we follow the singly linked list structure beginning with the pointer to the first property. To find a relationship for a node, we follow that node’s relationship pointer to its first relationship (the LIKES relationship in this example). From here, we then follow the doubly linked list of relationships for that particular node (that is, either the start node doubly linked list, or the end node doubly linked list) until we find the relationship we’re interested in. Having found the record for the relationship we want, we can read that relationship’s properties (if there are any) using the same singly linked list structure as is used for node properties, or we can examine the node records for the two nodes the relationship connects using its start node and end node IDs. These IDs, multiplied by the node record size, give the immediate offset of each node in the node store file.

格数据库 0605
图 6-5. Neo4j 中图形的物理存储方式

有了固定大小的记录和类似指针的记录 ID,遍历只需在数据结构周围追逐指针即可实现,速度非常快。为了将特定关系从一个节点遍历到另一个节点,数据库会执行几个廉价的 ID 计算(这些计算比搜索全局索引便宜得多,因为如果在非图本机数据库中伪造图,我们就必须这样做):

With fixed-sized records and pointer-like record IDs, traversals are implemented simply by chasing pointers around a data structure, which can be performed at very high speed. To traverse a particular relationship from one node to another, the database performs several cheap ID computations (these computations are much cheaper than searching global indexes, as we’d have to do if faking a graph in a nongraph native database):

  1. 从给定的节点记录开始,通过计算其在关系存储中的偏移量(即,将其 ID 乘以固定关系记录大小)来定位关系链中的第一个记录。这样我们就可以直接找到关系存储中的正确记录。
  2. From a given node record, locate the first record in the relationship chain by computing its offset into the relationship store — that is, by multiplying its ID by the fixed relationship record size. This gets us directly to the right record in the relationship store.
  3. 从关系记录中,查看第二个节点字段以找到第二个节点的 ID。将该 ID 乘以节点记录大小以在存储中找到正确的节点记录。
  4. From the relationship record, look in the second node field to find the ID of the second node. Multiply that ID by the node record size to locate the correct node record in the store.

如果我们希望将遍历限制为具有特定类型的关系,我们将在关系类型存储中添加查找。同样,这只是将 ID 乘以记录大小,以找到关系存储中相应关系类型记录的偏移量。同样,如果我们选择按标签进行限制,我们将引用标签存储。

Should we wish to constrain the traversal to relationships with particular types, we’d add a lookup in the relationship type store. Again, this is a simple multiplication of ID by record size to find the offset for the appropriate relationship type record in the relationship store. Similarly if we choose to constrain by label, we reference the label store.

除了包含图形结构的节点和关系存储之外,我们还有属性存储文件,它们将用户的数据保存在键值对中。回想一下,Neo4j 是一个属性图形数据库,它允许将属性(名称值对)附加到节点和关系上。因此,属性存储既可以从节点记录引用,也可以从关系记录引用。

In addition to the node and relationship stores, which contain the graph structure, we have property store files, which persist the user’s data in key-value pairs. Recall that Neo4j, being a property graph database, allows properties — name-value pairs — to be attached to both nodes and relationships. The property stores, therefore, are referenced from both node and relationship records.

属性存储中的记录物理地存储在文件中neostore.propertystore.db。与节点和关系存储一样,属性记录的大小是固定的。每个属性记录由四个属性块和属性链中下一个属性的 ID 组成(请记住,属性在磁盘上以单链表的形式保存,而关系链中使用的是双链表)。每个属性占用一到四个属性块 — 因此,属性记录最多可以保存四个属性。属性记录保存属性类型(Neo4j 允许任何原始 JVM 类型,加上字符串,加上 JVM 原始类型的数组)和指向属性索引文件 ( neostore.propertystore.db.index) 的指针,属性名称存储在该文件中。对于每个属性的值,记录包含指向动态存储记录的指针或内联值。动态存储允许存储较大的属性值。有两种动态存储:动态字符串存储 ( neostore.propertystore.db.strings) 和动态数组存储 ( neostore.propertystore.db.arrays)。动态记录由固定大小记录的链接列表组成;因此,非常大的字符串或大型数组可能占用多个动态记录。

Records in the property store are physically stored in the neostore.propertystore.db file. As with the node and relationship stores, property records are of a fixed size. Each property record consists of four property blocks and the ID of the next property in the property chain (remember, properties are held as a singly linked list on disk as compared to the doubly linked list used in relationship chains). Each property occupies between one and four property blocks — a property record can, therefore, hold a maximum of four properties. A property record holds the property type (Neo4j allows any primitive JVM type, plus strings, plus arrays of the JVM primitive types), and a pointer to the property index file (neostore.propertystore.db.index), which is where the property name is stored. For each property’s value, the record contains either a pointer into a dynamic store record or an inlined value. The dynamic stores allow for storing large property values. There are two dynamic stores: a dynamic string store (neostore.propertystore.db.strings) and a dynamic array store (neostore.propertystore.db.arrays). Dynamic records comprise linked lists of fixed-sized records; a very large string, or large array, may, therefore, occupy more than one dynamic record.

拥有高效的存储布局只是成功的一半。尽管存储文件已经针对快速遍历进行了优化,但硬件考虑因素仍然会对性能产生重大影响。近年来,内存容量已大幅增加;尽管如此,非常大的图仍然会超出我们完全将其保存在主内存中的能力。旋转磁盘的寻道时间约为个位数,虽然以人类标准来看很快,但从计算角度来看却非常缓慢。固体状态磁盘(SSD)要好得多(因为等待盘片旋转时不会产生明显的寻道损失),但 CPU 和磁盘之间的路径仍然比到 L2 缓存或主内存的路径更具潜在性,而理想情况下我们希望在图表上操作这两个路径。

Having an efficient storage layout is only half the picture. Despite the store files having been optimized for rapid traversals, hardware considerations can still have a significant impact on performance. Memory capacity has increased significantly in recent years; nonetheless, very large graphs will still exceed our ability to hold them entirely in main memory. Spinning disks have millisecond seek times in the order of single digits, which, though fast by human standards, are ponderously slow in computing terms. Solid state disks (SSDs) are far better (because there’s no significant seek penalty waiting for platters to rotate), but the path between CPU and disk is still more latent than the path to L2 cache or main memory, which is where ideally we’d like to operate on our graph.

为了缓解机械/电子大容量存储设备的性能特征,许多图形数据库使用内存缓存来提供对图形的概率低延迟访问。从 Neo4j 2.2 开始,使用堆外缓存来提高性能。

To mitigate the performance characteristics of mechanical/electronic mass storage devices, many graph databases use in-memory caching to provide probabilistic low-latency access to the graph. From Neo4j 2.2, an off-heap cache is used to deliver this performance boost.

从 Neo4j 2.2 开始,Neo4j 使用 LRU-K 页面缓存。页面cache 是一种 LRU-K 页面关联缓存,这意味着缓存将每个存储划分为离散区域,然后每个存储文件保存固定数量的区域。页面根据最不常用 (LFU) 缓存策略从缓存中逐出,并根据页面受欢迎程度进行细微调整。也就是说,不受欢迎的页面将优先于受欢迎的页面从缓存中逐出,即使后者最近没有被触及。此策略可确保缓存资源在统计上得到最佳利用。

As of Neo4j 2.2, Neo4j uses an LRU-K page cache. The page cache is an LRU-K page-affined cache, meaning the cache divides each store into discrete regions, and then holds a fixed number of regions per store file. Pages are evicted from the cache based on a least frequently used (LFU) cache policy, nuanced by page popularity. That is, unpopular pages will be evicted from the cache in preference to popular pages, even if the latter haven’t been touched recently. This policy ensures a statistically optimal use of caching resources.

编程 API

Programmatic APIs

尽管文件系统和缓存基础结构本身就很吸引人,但开发人员很少直接与它们交互。相反,开发人员通过查询语言来操作图形数据库,查询语言可以是命令式的,也可以是声明式的。本书中的示例使用 Cypher 查询语言,这是 Neo4j 原生的声明式查询语言,因为它是一种易于学习和使用的语言。但是,还存在其他 API,根据我们正在做的事情,我们可能需要优先考虑不同的问题。在着手一个新项目时,了解 API 的选择及其功能非常重要。如果说本节有什么值得借鉴的地方,那就是这些 API 可以被视为一个堆栈,如图6-6所示:在顶部,我们看重表现力和声明式编程;在底部,我们看重精度、命令式风格和(在最低层)“裸机”性能。

Although the filesystem and caching infrastructures are fascinating in themselves, developers rarely interact with them directly. Instead, developers manipulate a graph database through a query language, which can be either imperative or declarative. The examples in this book use the Cypher query language, the declarative query language native to Neo4j, because it is an easy language to learn and use. Other APIs exist, however, and depending on what we are doing, we may need to prioritize different concerns. It’s important to understand the choice of APIs and their capabilities when embarking on a new project. If there is any one thing to take away from this section, it is the notion that these APIs can be thought of as a stack, as depicted in Figure 6-6: at the top we prize expressiveness and declarative programming; at the bottom we prize precision, imperative style, and (at the lowest layer) “bare metal” performance.

格德布 0607
图 6-6。Neo4j中面向用户的 API 的逻辑视图

我们在第 3 章中详细讨论了 Cypher 。在以下部分中,我们将从下到上逐步介绍其余 API。此 API 导览旨在说明。并非所有图形数据库都具有相同数量的层,也不一定所有层的行为和交互方式都完全相同。每个 API 都有其优点和缺点,您应该研究这些优点和缺点,以便做出明智的决定。

We discussed Cypher in some detail in Chapter 3. In the following sections we’ll step through the remaining APIs from the bottom to the top. This API tour is meant to be illustrative. Not all graph databases have the same number of layers, nor necessarily layers that behave and interact in precisely the same way. Each API has its advantages and disadvantages, which you should investigate so you can make an informed decision.

内核 API

Kernel API

API 堆栈的最低层是内核的交易事件处理程序。这些处理程序允许用户代码在交易流经内核时监听交易,然后根据交易的数据内容和生命周期阶段做出反应(或不做出反应)。

At the lowest level of the API stack are the kernel’s transaction event handlers. These allow user code to listen to transactions as they flow through the kernel, and thereafter to react (or not) based on the data content and lifecycle stage of the transaction.

核心 API

Core API

Neo4j 的核心 API 是一种命令式 Java API,它向用户公开节点、关系、属性和标签的图形基元。当用于读取时,API 会被延迟评估,这意味着只有在调用代码需要下一个节点时才会遍历关系。数据会以 API 调用者可以使用的速度从图形中检索,调用者可以选择随时终止遍历。对于写入,核心 API 提供事务管理功能,以确保原子性、一致性、隔离性和持久性。

Neo4j’s Core API is an imperative Java API that exposes the graph primitives of nodes, relationships, properties, and labels to the user. When used for reads, the API is lazily evaluated, meaning that relationships are only traversed as and when the calling code demands the next node. Data is retrieved from the graph as quickly as the API caller can consume it, with the caller having the option to terminate the traversal at any point. For writes, the Core API provides transaction management capabilities to ensure atomic, consistent, isolated, and durable persistence.

在下面的代码中,我们看到了从Neo4j 教程中借用的一段代码,我们试图在其中找到来自神秘博士宇宙的人类同伴:2

In the following code, we see a snippet of code borrowed from the Neo4j tutorial in which we try to find human companions from the Doctor Who universe:2

// Index lookup for the node representing the Doctor is omitted for brevity

Iterable<Relationship> relationships =
            doctor.getRelationships( Direction.INCOMING, COMPANION_OF );

for ( Relationship rel : relationships )
{
    Node companionNode = rel.getStartNode();
    if ( companionNode.hasRelationship( Direction.OUTGOING, IS_A ) )
    {
        Relationship singleRelationship = companionNode
                                            .getSingleRelationship( IS_A,
                                                     Direction.OUTGOING );
        Node endNode = singleRelationship.getEndNode();
        if ( endNode.equals( human ) )
        {
            // Found one!
        }
    }
}
// Index lookup for the node representing the Doctor is omitted for brevity

Iterable<Relationship> relationships =
            doctor.getRelationships( Direction.INCOMING, COMPANION_OF );

for ( Relationship rel : relationships )
{
    Node companionNode = rel.getStartNode();
    if ( companionNode.hasRelationship( Direction.OUTGOING, IS_A ) )
    {
        Relationship singleRelationship = companionNode
                                            .getSingleRelationship( IS_A,
                                                     Direction.OUTGOING );
        Node endNode = singleRelationship.getEndNode();
        if ( endNode.equals( human ) )
        {
            // Found one!
        }
    }
}

这段代码非常关键:我们只需循环遍历 Doctor 的同伴,并检查是否有任何同伴节点IS_A与代表人类物种的节点有关系。如果同伴节点与人类物种节点有联系,我们就对它进行处理。

This code is very imperative: we simply loop round the Doctor’s companions and check to see if any of the companion nodes have an IS_A relationship to the node representing the human species. If the companion node is connected to the human species node, we do something with it.

由于它是一个命令式 API,因此 Core API 要求我们根据底层图形结构对其进行微调。这可能非常快。但与此同时,这意味着我们最终会将特定域结构的知识融入到我们的代码中。与更高级别的 API(尤其是 Cypher)相比,需要更多代码才能实现相同的目标。尽管如此,Core API 和底层记录存储之间的亲和性显而易见——Core API 相对忠实地向用户代码展示了在存储和缓存级别使用的结构。

Because it is an imperative API, the Core API requires us to fine-tune it to the underlying graph structure. This can be very fast. At the same time, however, it means we end up baking knowledge of our specific domain structure into our code. Compared to the higher-level APIs (particularly Cypher) more code is needed to achieve an equivalent goal. Nonetheless, the affinity between the Core API and the underlying record store is plain to see — the structures used at the store and cache level are exposed relatively faithfully by the Core API to user code.

遍历框架

Traversal Framework

遍历框架是一种声明式 Java API。它使用户能够指定一组约束,限制遍历允许访问的图形部分。我们可以指定要遵循的关系类型以及方向(有效地指定关系过滤器);我们可以指示我们希望遍历是广度优先还是深度优先;我们可以指定一个用户定义的路径评估器,该评估器在遇到每个节点时触发。在遍历的每一步,此评估器都会确定接下来如何进行遍历。以下代码片段展示了 Traversal API 的实际应用:

The Traversal Framework is a declarative Java API. It enables the user to specify a set of constraints that limit the parts of the graph the traversal is allowed to visit. We can specify which relationship types to follow, and in which direction (effectively specifying relationship filters); we can indicate whether we want the traversal to be performed breadth-first or depth-first; and we can specify a user-defined path evaluator that is triggered with each node encountered. At each step of the traversal, this evaluator determines how the traversal is to proceed next. The following code snippet shows the Traversal API in action:

Traversal.description()
  .relationships( DoctorWhoRelationships.PLAYED, Direction.INCOMING )
  .breadthFirst()
  .evaluator( new Evaluator()
  {
    public Evaluation evaluate( Path path )
    {
      if ( path.endNode().hasRelationship(
                            DoctorWhoRelationships.REGENERATED_TO ) )
      {
        return Evaluation.INCLUDE_AND_CONTINUE;
      }
      else
      {
        return Evaluation.EXCLUDE_AND_CONTINUE;
      }
    }
  } );
Traversal.description()
  .relationships( DoctorWhoRelationships.PLAYED, Direction.INCOMING )
  .breadthFirst()
  .evaluator( new Evaluator()
  {
    public Evaluation evaluate( Path path )
    {
      if ( path.endNode().hasRelationship(
                            DoctorWhoRelationships.REGENERATED_TO ) )
      {
        return Evaluation.INCLUDE_AND_CONTINUE;
      }
      else
      {
        return Evaluation.EXCLUDE_AND_CONTINUE;
      }
    }
  } );

通过此代码片段,可以清楚地看到遍历框架的主要声明性。该relationships()方法声明只能遍历方向PLAYED上的关系INCOMING。此后,我们声明遍历应以某种breadthFirst()方式执行,这意味着它将在进一步向外移动之前访问所有最近的邻居。

With this snippet it’s plain to see the predominantly declarative nature of the Traversal Framework. The relationships() method declares that only PLAYED relationships in the INCOMING direction may be traversed. Thereafter, we declare that the traversal should be executed in a breadthFirst() manner, meaning it will visit all nearest neighbors before going further outward.

遍历框架在导航图结构方面是声明式的。Evaluator但是,对于我们的实现,我们下拉到命令式核心 API。也就是说,我们使用核心 API 来确定,给定当前节点的路径,是否需要进一步通过图进行跳跃(我们还可以使用核心 API 从评估器内部修改图)。同样,数据库内部的原生图结构在这里浮出水面,节点、关系和属性的图基元在 API 中占据中心位置。

The Traversal Framework is declarative with regard to navigating graph structure. For our implementation of the Evaluator, however, we drop down to the imperative Core API. That is, we use the Core API to determine, given the path to the current node, whether or not further hops through the graph are necessary (we can also use the Core API to modify the graph from inside an evaluator). Again, the native graph structures inside the database bubble close to the surface here, with the graph primitives of nodes, relationships, and properties taking center stage in the API.

以上就是我们对图形编程 API 的简要介绍,以原生 Neo4j API 为例。我们已经了解了这些 API 如何反映 Neo4j 堆栈较低级别中使用的结构,以及这种一致性如何允许惯用且快速的图形遍历。

This concludes our brief survey of graph programming APIs, using the native Neo4j APIs as an example. We’ve seen how these APIs reflect the structures used in the lower levels of the Neo4j stack, and how this alignment permits idiomatic and rapid graph traversals.

然而,数据库速度快还不够,它还必须可靠。这让我们开始讨论图形数据库的非功能性特征。

It’s not enough for a database to be fast, however; it must also be dependable. This brings us to a discussion of the nonfunctional characteristics of graph databases.

非功能性特征

Nonfunctional Characteristics

至此,我们已经了解了构建原生图形数据库的含义,并以 Neo4j 为例,了解了这些原生图形功能是如何实现的。但要被视为可靠的,任何数据存储技术都必须对存储数据的持久性和可访问性提供一定程度的保证。3

At this point we’ve understood what it means to construct a native graph database, and have seen how some of these graph-native capabilities are implemented, using Neo4j as our example. But to be considered dependable, any data storage technology must provide some level of guarantee as to the durability and accessibility of the stored data.3

传统上,关系数据库的一个常用评估指标是每秒处理的事务数。在关系世界中,假设这些事务支持 ACID 属性(即使出现故障),这样数据就是一致的且可恢复的。对于不间断处理和管理大量数据,关系数据库需要扩展,以便有多个实例可用于处理查询和更新,并且单个实例的丢失不会对整个集群的运行造成过度影响。

One common measure by which relational databases are traditionally evaluated is the number of transactions per second they can process. In the relational world, it is assumed that these transactions uphold the ACID properties (even in the presence of failures) such that data is consistent and recoverable. For nonstop processing and managing of large volumes, a relational database is expected to scale so that many instances are available to process queries and updates, with the loss of an individual instance not unduly affecting the running of the cluster as a whole.

至少在高层次上,图形数据库也同样如此。它们需要保证一致性、从崩溃中正常恢复并防止数据损坏。此外,它们需要扩展以提供高可用性,并扩展以提高性能。在以下部分中,我们将探讨这些要求对于图形数据库架构意味着什么。再次,我们将通过深入研究 Neo4j 的架构来扩展某些观点,以提供具体示例。应该指出的是,并非所有图形数据库都是完全 ACID 的。因此,了解所选数据库的事务模型的细节非常重要。Neo4j 的 ACID 事务性显示了可以从图形数据库获得的相当高的可靠性水平 — 我们习惯于从企业级关系数据库管理系统中获得的级别。

At a high level at least, much the same applies to graph databases. They need to guarantee consistency, recover gracefully from crashes, and prevent data corruption. Further, they need to scale out to provide high availability, and scale up for performance. In the following sections we’ll explore what each of these requirements means for a graph database architecture. Once again, we’ll expand on certain points by delving into Neo4j’s architecture as a means of providing concrete examples. It should be pointed out that not all graph databases are fully ACID. It is important, therefore, to understand the specifics of the transaction model of your chosen database. Neo4j’s ACID transactionality shows the considerable levels of dependability that can be obtained from a graph database — levels we are accustomed to obtaining from enterprise-class relational database management systems.

交易

Transactions

几十年来,事务一直是可靠计算系统的基石。尽管许多 NOSQL 存储都不是事务性的,部分原因是存在一个未经证实的假设,即事务系统的扩展性较差,但事务仍然是当代图形数据库(包括 Neo4j)中可靠性的基本抽象。(事务限制可扩展性的说法有一定的道理,因为分布式两阶段提交在病理情况下可能会出现不可用性问题,但总的来说,这种影响比通常假设的要小得多。)

Transactions have been a bedrock of dependable computing systems for decades. Although many NOSQL stores are not transactional, in part because there’s an unvalidated assumption that transactional systems scale less well, transactions remain a fundamental abstraction for dependability in contemporary graph databases — including Neo4j. (There is some truth to the claim that transactions limit scalability, insofar as distributed two-phase commit can exhibit unavailability problems in pathological cases, but in general the effect is much less marked than is often assumed.)

Neo4j 中的事务在语义上与传统数据库事务相同。写入发生在事务上下文中,为了保持一致性,事务中涉及的任何节点和关系都会被锁定。事务成功完成后,更改将刷新到磁盘以确保持久性,并释放写入锁定。这些操作维护了事务的原子性保证。如果事务因某种原因失败,则写入将被丢弃并释放写入锁定,从而将图保持在先前的一致状态。

Transactions in Neo4j are semantically identical to traditional database transactions. Writes occur within a transaction context, with write locks being taken for consistency purposes on any nodes and relationships involved in the transaction. On successful completion of the transaction, changes are flushed to disk for durability, and the write locks released. These actions maintain the atomicity guarantees of the transaction. Should the transaction fail for some reason, the writes are discarded and the write locks released, thereby maintaining the graph in its previous consistent state.

如果两个或多个事务尝试同时更改相同的图形元素,Neo4j 将检测到潜在的死锁情况,并序列化事务。单个事务上下文中的写入对其他事务不可见,从而保持隔离。

Should two or more transactions attempt to change the same graph elements concurrently, Neo4j will detect a potential deadlock situation, and serialize the transactions. Writes within a single transactional context will not be visible to other transactions, thereby maintaining isolation.

一旦事务提交,系统就处于这样一种状态:即使故障导致非病理性故障,更改也保证存在于数据库中。正如我们现在将看到的,这为可恢复性带来了巨大的优势,从而为持续提供服务带来了巨大的优势。

Once a transaction has committed, the system is in a state where changes are guaranteed to be in the database even if a fault then causes a non-pathological failure. This, as we shall now see, confers substantial advantages for recoverability, and hence for ongoing provision of service.

可恢复性

Recoverability

数据库与任何其他软件系统一样,都容易受到其实现、运行硬件以及硬件的电源、冷却和连接方面的错误的影响。尽管勤奋的工程师试图将所有这些方面发生故障的可能性降至最低,但在某些时候,数据库崩溃是不可避免的——尽管平均故障间隔时间确实应该很长。

Databases are no different from any other software system in that they are susceptible to bugs in their implementation, in the hardware they run on, and in that hardware’s power, cooling, and connectivity. Though diligent engineers try to minimize the possibility of failure in all of these, at some point it’s inevitable that a database will crash — though the mean time between failures should be very long indeed.

在设计良好的系统中,数据库服务器崩溃虽然令人烦恼,但不应影响可用性,尽管它可能会影响吞吐量。当发生故障的服务器恢复运行时,无论崩溃的性质或时间如何,它都不能向用户提供损坏的数据。

In a well-designed system, a database server crash, though annoying, ought not affect availability, though it may affect throughput. And when a failed server resumes operation, it must not serve corrupt data to its users, irrespective of the nature or timing of the crash.

从非正常关闭(可能是由于故障或操作员过于热心导致)中恢复时,Neo4j 会检查最近活动的事务日志并重放针对存储找到的所有事务。其中一些事务可能已经应用于存储,但由于重放是幂等操作,因此最终结果是相同的:恢复后,存储将与故障前成功提交的所有事务保持一致。

When recovering from an unclean shutdown, perhaps caused by a fault or even an overzealous operator, Neo4j checks in the most recently active transaction log and replays any transactions it finds against the store. It’s possible that some of those transactions may have already been applied to the store, but because replaying is an idempotent action, the net result is the same: after recovery, the store will be consistent with all transactions successfully committed prior to the failure.

对于单个数据库实例,本地恢复是必需的。但是,通常我们在集群中运行数据库(我们将在稍后讨论)以确保客户端应用程序的高可用性。幸运的是,集群为恢复实例带来了额外的好处。如前所述,实例不仅会与故障前成功提交的所有事务保持一致,还可以快速赶上集群中的其他实例,从而与故障成功提交的所有事务保持一致。也就是说,一旦本地恢复完成,副本就可以向集群的其他成员(通常是主节点)询问任何较新的事务。然后,它可以通过事务重播将这些较新的事务应用于自己的数据集。

Local recovery is all that is necessary in the case of a single database instance. Generally, however, we run databases in clusters (which we’ll discuss shortly) to assure high availability on behalf of client applications. Fortunately, clustering confers additional benefits to recovering instances. Not only will an instance become consistent with all transactions successfully committed prior to its failure, as discussed earlier, it can also quickly catch up with other instances in the cluster, and thereby be consistent with all transactions successfully committed subsequent to its failure. That is, once local recovery has completed, a replica can ask other members of the cluster — typically the master — for any newer transactions. It can then apply these newer transactions to its own dataset via transaction replay.

可恢复性是指数据库在发生故障后恢复原状的能力。除了可恢复性之外,一个好的数据库还需要具有高可用性,以满足数据密集型应用程序日益复杂的需求。

Recoverability deals with the capability of the database to set things right after a fault has arisen. In addition to recoverability, a good database needs to be highly available to meet the increasingly sophisticated needs of data-heavy applications.

可用性

Availability

除了本身的价值之外,Neo4j 的事务和恢复功能还有利于其高可用性特性。数据库能够在实例崩溃后识别实例并在必要时进行修复,这意味着无需人工干预,数据就可以快速恢复可用。当然,更多的活动实例可以提高数据库处理查询的整体可用性。

In addition to being valuable in and of themselves, Neo4j’s transaction and recovery capabilities also benefit its high-availability characteristics. The database’s ability to recognize and, if necessary, repair an instance after crashing means that data quickly becomes available again without human intervention. And of course, more live instances increases the overall availability of the database to process queries.

在典型的生产场景中,通常不需要单独的断开连接的数据库实例。更常见的是,我们将数据库实例集群化以实现高可用性。Neo4j 使用主从集群安排来确保图形的完整副本存储在每台机器上。写入会以频繁的间隔从主服务器复制到从服务器。在任何时候,主服务器和一些从服务器都会拥有图形的完全最新的副本,而其他从服务器则会赶上(通常,它们只会落后几毫秒)。

It’s uncommon to want individual disconnected database instances in a typical production scenario. More often, we cluster database instances for high availability. Neo4j uses a master-slave cluster arrangement to ensure that a complete replica of the graph is stored on each machine. Writes are replicated out from the master to the slaves at frequent intervals. At any point, the master and some slaves will have a completely up-to-date copy of the graph, while other slaves will be catching up (typically, they will be but milliseconds behind).

对于写入,经典的写入主服务器和读取从服务器是一种流行的拓扑结构。在这种设置下,所有数据库写入都指向主服务器,读取操作指向从服务器。这为写入提供了渐进可扩展性(最高可达单个主轴的容量),但为读取提供了近乎线性的可扩展性(考虑到管理集群的适度开销)。

For writes, the classic write-master with read-slaves is a popular topology. With this setup, all database writes are directed at the master, and read operations are directed at slaves. This provides asymptotic scalability for writes (up to the capacity of a single spindle) but allows for near linear scalability for reads (accounting for the modest overhead in managing the cluster).

尽管“写入主服务器,读取从服务器”是经典的部署拓扑,但 Neo4j 也支持通过从服务器写入。在这种情况下,客户端将写入指向的从服务器首先确保其与主服务器一致(“追赶”);此后,写入将在两个实例之间同步进行。当我们希望两个数据库实例立即具有持久性时,这非常有用。此外,由于它允许将写入指向任何实例,因此它提供了额外的部署灵活性。但是,由于强制追赶阶段,这会导致更高的写入延迟。这并不意味着写入分布在系统周围:所有写入仍必须在某个时刻通过主服务器。

Although write-master with read-slaves is a classic deployment topology, Neo4j also supports writing through slaves. In this scenario, the slave to which a write has been directed by the client first ensures that it is consistent with the master (it “catches up”); thereafter, the write is synchronously transacted across both instances. This is useful when we want immediate durability in two database instances. Furthermore, because it allows writes to be directed to any instance, it offers additional deployment flexibility. This comes at the cost of higher write latency, however, due to the forced catchup phase. It does not imply that writes are distributed around the system: all writes must still pass through the master at some point.

可用性的另一个方面是资源访问权的争用。争用对图的特定部分的独占访问权(例如写入)的操作可能会遭受足够高的延迟,以至于看起来不可用。我们在 RDBMS 中的粗粒度表级锁定中看到了类似的争用,即使逻辑上没有争用,写入也是潜在的。

Another aspect of availability is contention for access to resources. An operation that contends for exclusive access (e.g., for writes) to a particular part of the graph may suffer from sufficiently high latency as to appear unavailable. We’ve seen similar contention with coarse-grained table-level locking in RDBMSs, where writes are latent even when there’s logically no contention.

幸运的是,在图中,访问模式往往分布得更均匀,尤其是在执行惯用的图本地查询时。图本地操作是从图中一个或多个给定位置开始,然后遍历周围子图的操作。此类查询的起点往往是域中特别重要的事物,例如用户或产品。这些起点导致整体查询负载以较低的争用率分布。反过来,客户端会感受到更高的响应能力和更高的可用性。

Fortunately, in a graph, access patterns tend to be more evenly spread, especially where idiomatic graph-local queries are executed. A graph-local operation is one that starts at one or more given places in the graph and then traverses the surrounding subgraphs. The starting points for such queries tend to be things that are especially significant in the domain, such as users or products. These starting points result in the overall query load being distributed with low contention. In turn, clients perceive greater responsiveness and higher availability.

我们对可用性的最终观察是,集群范围复制的扩展具有积极影响,不仅在容错方面,而且在响应能力方面。由于给定的工作负载有许多可用的机器,因此查询延迟很低,并且可用性得以保持。但正如我们现在讨论的那样,规模本身比我们部署的服务器数量更为微妙。

Our final observation on availability is that scaling for cluster-wide replication has a positive impact, not just in terms of fault-tolerance, but also responsiveness. Because there are many machines available for a given workload, query latency is low and availability is maintained. But as we’ll now discuss, scale itself is more nuanced than simply the number of servers we deploy.

规模

Scale

随着数据量的增长,规模问题变得越来越重要。事实上,大规模数据问题(关系数据库很难解决)一直是 NOSQL 运动的重要动力。从某种意义上说,图形数据库也不例外;毕竟,它们也需要扩展以满足现代应用程序的工作负载需求。但规模并不是像每秒事务数这样的简单值。相反,它是我们跨多个轴测量的聚合值。

The topic of scale has become more important as data volumes have grown. In fact, the problems of data at scale, which have proven difficult to solve with relational databases, have been a substantial motivation for the NOSQL movement. In some sense, graph databases are no different; after all, they also need to scale to meet the workload demands of modern applications. But scale isn’t a simple value like transactions per second. Rather, it’s an aggregate value that we measure across multiple axes.

对于图形数据库,我们将把关于规模的广泛讨论分解为三个关键主题:

For graph databases, we will decompose our broad discussion on scale into three key themes:

  1. 容量(图形大小)
  2. Capacity (graph size)
  3. 延迟(响应时间)
  4. Latency (response time)
  5. 读写吞吐量
  6. Read and write throughput

容量

Capacity

一些图形数据库供应商选择放弃图形大小的任何上限,以换取性能和存储成本。Neo4j 历来采取了一种独特的方法,通过针对位于 95% 用例的图大小进行优化,保持了一个“最佳点”,即实现更快的性能和更低的存储(从而减少内存占用和 IO 操作)。权衡的原因在于使用固定的记录大小和指针,它在存储中广泛使用(如“本机图存储”中所述)。在撰写本文时,Neo4j 的最新版本可以支持具有数百亿个节点、关系和属性的单个图。这允许具有与 Facebook 大小大致相当的社交网络数据集的图。

Some graph database vendors have chosen to eschew any upper bounds in graph size in exchange for performance and storage cost. Neo4j has taken a somewhat unique approach historically, having maintained a “sweet spot” that achieves faster performance and lower storage (and consequently diminished memory footprint and IO-ops) by optimizing for graph sizes that lie at or below the 95th percentile of use cases. The reason for the trade-off lies in the use of fixed record sizes and pointers, which (as discussed in “Native Graph Storage”) it uses extensively inside of the store. At the time of writing, the current release of Neo4j can support single graphs having tens of billions of nodes, relationships, and properties. This allows for graphs with a social networking dataset roughly the size of Facebook’s.


笔记

Neo4j 团队已公开表达了在其路线图中支持单个图中 100B+ 节点/关系/属性的意图。

The Neo4j team has publicly expressed the intention to support 100B+ nodes/relationships/properties in a single graph as part of its roadmap.


数据集必须有多大才能充分利用图形数据库提供的所有优势?答案是,比您想象的要小。对于二度或三度查询,性能优势体现在具有几千个节点的数据集上。查询的度越高,增量就越大。易于开发的优势当然与数据量无关,并且无论数据库大小如何都可以获得。作者已经看到有意义的生产应用程序范围从几万个节点和几十万个关系到数十亿个节点和关系。

How large must a dataset be to take advantage of all of the benefits a graph database has to offer? The answer is, smaller than you might think. For queries of second or third degree, the performance benefits show with datasets having a few single-digit thousand nodes. The higher the degree of the query, the more extreme the delta. The ease-of-development benefits are of course unrelated to data volume, and available regardless of the database size. The authors have seen meaningful production applications range from as small as a few tens of thousands of nodes, and a few hundred thousand relationships, to billions of nodes and relationships.

延迟

Latency

图形数据库不会像传统关系数据库那样遭受延迟问题,在传统关系数据库中,表中的数据越多(以及索引中的数据越多),连接操作就越长(这个简单的事实是性能调优几乎总是关系数据库管理员最关心的问题的主要原因之一)。对于图形数据库,大多数查询都遵循一种模式,即索引仅用于查找起始节点(或节点)。然后,遍历的其余部分使用指针追踪和模式匹配的组合来搜索数据存储。这意味着,与关系数据库不同,性能不取决于数据集的总大小,而只取决于查询的数据。这导致性能时间几乎保持不变(即与结果集的大小有关),即使数据集的大小增加(尽管如我们在第 3 章中讨论的那样,即使我们处理的数据量较少,调整图形结构以适应查询仍然是明智的)。

Graph databases don’t suffer the same latency problems as traditional relational databases, where the more data we have in tables — and in indexes — the longer the join operations (this simple fact of life is one of the key reasons that performance tuning is nearly always the very top issue on a relational DBA’s mind). With a graph database, most queries follow a pattern whereby an index is used simply to find a starting node (or nodes). The remainder of the traversal then uses a combination of pointer chasing and pattern matching to search the data store. What this means is that, unlike relational databases, performance does not depend on the total size of the dataset, but only on the data being queried. This leads to performance times that are nearly constant (that is, are related to the size of the result set), even as the size of the dataset grows (though as we discussed in Chapter 3, it’s still sensible to tune the structure of the graph to suit the queries, even if we’re dealing with lower data volumes).

吞吐量

Throughput

我们可能认为图形数据库需要以与其他数据库相同的方式扩展。但事实并非如此。当我们查看 IO 密集型应用程序行为时,我们发现单个复杂业务操作通常会读取和写入一组相关数据。换句话说,应用程序对整个数据集内的逻辑子图执行多项操作。使用图形数据库,可以将这些多项操作汇总为更大、更具凝聚力的操作。此外,使用图形原生存储,执行每项操作所需的计算工作量比等效关系操作要少。图形通过以更少的工作量获得相同结果而实现扩展。

We might think a graph database would need to scale in the same way as other databases. But this isn’t the case. When we look at IO-intensive application behaviors, we see that a single complex business operation typically reads and writes a set of related data. In other words, the application performs multiple operations on a logical subgraph within the overall dataset. With a graph database such multiple operations can be rolled up into larger, more cohesive operations. Further, with a graph-native store, executing each operation takes less computational effort than the equivalent relational operation. Graphs scale by doing less work for the same outcome.

例如,想象一个出版场景,我们想阅读某位作者的最新作品。在 RDBMS 中,我们通常通过基于匹配的作者 ID 将作者表连接到出版物表来选择作者的作品,然后按出版日期对出版物进行排序并限制为最新的少数出版物。根据排序操作的特征,这可能是一个O(log(n))操作,这并不是很糟糕。

For example, imagine a publishing scenario in which we’d like to read the latest piece from an author. In a RDBMS we typically select the author’s works by joining the authors table to a table of publications based on matching author ID, and then ordering the publications by publication date and limiting to the newest handful. Depending on the characteristics of the ordering operation, that might be a O(log(n)) operation, which isn’t so very bad.

但是,如图 6-7所示,等效的图形操作是O(1),这意味着无论数据集大小如何,性能都是恒定的。使用图形,我们只需遵循WROTE从作者到已发布文章列表(或树)顶部的作品的出站关系即可。如果我们希望找到较旧的出版物,我们只需遵循PREV关系并遍历链接列表(或者,通过树进行递归)。写入也是如此,因为我们总是将新出版物插入列表的头部(或树的根),这是另一个恒定时间操作。这与 RDBMS 替代方案相比更具优势,特别是因为它自然地保持了读取的恒定时间性能。

However, as shown in Figure 6-7, the equivalent graph operation is O(1), meaning constant performance irrespective of dataset size. With a graph we simply follow the outbound relationship called WROTE from the author to the work at the head of a list (or tree) of published articles. Should we wish to find older publications, we simply follow the PREV relationships and iterate through a linked list (or, alternatively, recurse through a tree). Writes follow suit because we always insert new publications at the head of the list (or root of a tree), which is another constant time operation. This compares favorably to the RDBMS alternative, particularly because it naturally maintains constant time performance for reads.

当然,最苛刻的部署将超出单台机器运行查询的能力,更具体地说是其 I/O 吞吐量。当这种情况发生时,使用 Neo4j 构建一个集群很简单,该集群可以水平扩展以实现高可用性和高读取吞吐量。对于典型的图形工作负载,读取量远远超过写入量,这种解决方案架构可能是理想的选择。

Of course, the most demanding deployments will overwhelm a single machine’s capacity to run queries, and more specifically its I/O throughput. When that happens, it’s straightforward to build a cluster with Neo4j that scales horizontally for high availability and high read throughput. For typical graph workloads, where reads far outstrip writes, this solution architecture can be ideal.

格数据库 0608
图 6-7。发布系统的恒定时间操作

应该如果我们超出了集群的容量,我们可以通过在应用程序中构建分片逻辑来将图形扩展到数据库实例中。分片涉及使用合成标识符在应用程序级别跨数据库实例连接记录。其性能如何在很大程度上取决于图形的形状。有些图形非常适合这样做。例如,Mozilla 使用 Neo4j 图形数据库作为其下一代云浏览器 Pancake 的一部分。它不是拥有单个大型图形,而是存储大量小型独立图形,每个图形都与最终用户绑定。这使得它非常容易扩展。

Should we exceed the capacity of a cluster, we can spread a graph across database instances by building sharding logic into the application. Sharding involves the use of a synthetic identifier to join records across database instances at the application level. How well this will perform depends very much on the shape of the graph. Some graphs lend themselves very well to this. Mozilla, for instance, uses the Neo4j graph database as part of its next-generation cloud browser, Pancake. Rather than having a single large graph, it stores a large number of small independent graphs, each tied to an end user. This makes it very easy to scale.

当然,并非所有图都有如此方便的边界。如果我们的图足够大,需要分解,但不存在自然边界,我们使用的方法与 NOSQL 存储(例如MongoDB:我们创建合成键,并使用这些键和一些应用程序级解析算法通过应用程序层关联记录。与 MongoDB 方法的主要区别在于,本机图形数据库将在您在数据库实例内进行遍历时为您提供性能提升,而实例之间运行的遍历部分将以与 MongoDB 连接大致相同的速度运行。但是,整体性能应该明显更快。

Of course, not all graphs have such convenient boundaries. If our graph is large enough that it needs to be broken up, but no natural boundaries exist, the approach we use is much the same as what we would use with a NOSQL store like MongoDB: we create synthetic keys, and relate records via the application layer using those keys plus some application-level resolution algorithm. The main difference from the MongoDB approach is that a native graph database will provide you with a performance boost anytime you are doing traversals within a database instance, whereas those parts of the traversal that run between instances will run at roughly the same speed as a MongoDB join. Overall performance should be markedly faster, however.

概括

Summary

在本章中,我们展示了属性图如何成为实用数据建模的绝佳选择。我们探索了图形数据库的架构,特别是 Neo4j 的架构,并讨论了图形数据库实现的非功能性特征以及可靠性对它们意味着什么。

In this chapter we’ve shown how property graphs are an excellent choice for pragmatic data modeling. We’ve explored the architecture of a graph database, with particular reference to the architecture of Neo4j, and discussed the nonfunctional characteristics of graph database implementations and what it means for them to be dependable.

1来自 Neo4j 2.2 的记录布局;其他版本可能有不同的大小。

1 Record layout from Neo4j 2.2; other versions may have different sizes.

2 《神秘博士》是世界上播出时间最长的科幻电视剧,也是 Neo4j 团队的最爱。

2 Doctor Who is the world’s longest-running science fiction show and a firm favorite of the Neo4j team.

3根据http://www.dependability.org/,可靠性的正式定义是“计算系统的可信度,这使得人们可以合理地信赖它所提供的服务”。

3 The formal definition of dependability is the “trustworthiness of a computing system, which allows reliance to be justifiably placed on the service it delivers” as per http://www.dependability.org/.

4参见http://en.wikipedia.org/wiki/Cut_(graph_theory)

4 See http://en.wikipedia.org/wiki/Cut_(graph_theory)

第 7 章使用图论进行预测分析

Chapter 7. Predictive Analysis with Graph Theory

在本章中,我们将研究一些用于处理图形数据的分析技术和算法。图论和图算法都是成熟且易于理解的计算科学领域,我们将演示如何使用它们从图形数据库中挖掘复杂信息。虽然具有计算机科学背景的读者无疑会认识到这些算法和技术,但本章中的讨论没有借助数学,以鼓励好奇的外行深入研究。

In this chapter we’re going to examine some analytical techniques and algorithms for processing graph data. Both graph theory and graph algorithms are mature and well-understood fields of computing science and we’ll demonstrate how both can be used to mine sophisticated information from graph databases. Although the reader with a background in computing science will no doubt recognize these algorithms and techniques, the discussion in this chapter is handled without recourse to mathematics, to encourage the curious layperson to dive in.

深度和广度优先搜索

Depth- and Breadth-First Search

在研究高阶分析技术之前,我们需要重新熟悉基本的广度优先搜索算法,这是对整个图进行迭代的基础。我们在本书中看到的大多数查询本质上都是深度优先而不是广度优先。也就是说,它们从起始节点向外遍历到某个终止节点,然后从同一起始节点沿不同的路径重复类似的搜索。当我们试图沿着一条路径发现离散的信息时,深度优先是一种很好的策略。

Before we look at higher-order analytical techniques we need to reacquaint ourselves with the fundamental breadth-first search algorithm, which is the basis for iterating over an entire graph. Most of the queries we’ve seen throughout this book have been depth-first rather than breadth-first in nature. That is, they traverse outward from a starting node to some end node before repeating a similar search down a different path from the same start node. Depth-first is a good strategy when we’re trying to follow a path to discover discrete pieces of information.

虽然我们使用深度优先搜索作为一般图遍历的基本策略,但许多有趣的算法以广度优先的方式遍历整个图。也就是说,它们一次探索图的一层,首先从起始节点访问深度为 1 的每个节点,然后访问深度为 2 的每个节点,然后访问深度为 3 的节点,依此类推,直到访问整个图。从标记为O(表示原点)的节点开始,然后一次向外一层地进行,这一过程很容易可视化,如图7-1所示。

Though we’ve used depth-first search as our underlying strategy for general graph traversals, many interesting algorithms traverse the entire graph in a breadth-first manner. That is, they explore the graph one layer at a time, first visiting each node at depth 1 from the start node, then each of those at depth 2, then depth 3, and so on, until the entire graph has been visited. This progression is easily visualized starting at the node labeled O (for origin) and progressing outward a layer at a time, as shown in Figure 7-1.

格德布 0701
图 7-1广度优先搜索的进展

搜索的终止取决于正在执行的算法——大多数有用的算法都不是纯广度优先搜索,而是在一定程度上是知情的。广度优先搜索通常用于路径查找算法或需要系统地搜索整个图时(例如我们在第 3 章中讨论的图全局算法)。

Termination of the search depends on the algorithm being executed — most useful algorithms aren’t pure breadth-first search but are informed to some extent. Breadth-first search is often used in path-finding algorithms or when the entire graph needs to be systematically searched (for the likes of graph global algorithms we discussed in Chapter 3).

使用 Dijkstra 算法进行路径搜索

Path-Finding with Dijkstra’s Algorithm

广度优先搜索是许多经典图算法的基础,包括 Dijkstra 算法。Dijkstra(通常缩写)用于查找图中两个节点之间的最短路径。Dijkstra 算法已经成熟,于 1956 年发布,此后被计算机科学家广泛研究和优化。其行为如下:

Breadth-first search underpins numerous classical graph algorithms, including Dijkstra’s algorithm. Dijkstra (as it is often abbreviated) is used to find the shortest path between two nodes in a graph. Dijkstra’s algorithm is mature, having been published in 1956, and thereafter widely studied and optimized by computer scientists. It behaves as follows:

  1. 选择起始节点和终止节点,并将起始节点添加到已解决节点 集(即,已知从起始节点到最短路径的节点集),其值为 0(根据定义,起始节点与自身的路径长度为 0)。
  2. Pick the start and end nodes, and add the start node to the set of solved nodes (that is, the set of nodes with known shortest path from the start node) with value 0 (the start node is by definition 0 path length away from itself).
  3. 从起始节点开始,广度优先遍历到最近邻居,并记录到每个邻居节点的路径长度。
  4. From the starting node, traverse breadth-first to the nearest neighbors and record the path length against each neighbor node.
  5. 采取到其中一个邻居的最短路径(在平局的情况下任意选择)并将该节点标记为已解决,因为我们现在知道从起始节点到该邻居的最短路径。
  6. Take the shortest path to one of these neighbors (picking arbitrarily in the case of ties) and mark that node as solved, because we now know the shortest path from the start node to this neighbor.
  7. 从已解决的节点集中,访问最近的邻居(注意长度优先顺序),并记录从起始节点到这些新邻居的路径长度。不要访问任何已解决的相邻节点,因为我们已经知道到它们的最短路径。
  8. From the set of solved nodes, visit the nearest neighbors (notice the breath-first progression) and record the path lengths from the start node against these new neighbors. Don’t visit any neighboring nodes that have already been solved, because we know the shortest paths to them already.
  9. 重复步骤 3 和 4,直到目标节点被标记为已解决。
  10. Repeat steps 3 and 4 until the destination node has been marked solved.

Dijkstra 经常用于寻找现实世界的最短路径(例如,用于导航)。下面是一个例子。在图 7-2中,我们看到了澳大利亚的逻辑地图。我们的挑战是找到东海岸的悉尼(标记为SYD)和西海岸的珀斯(标记为PER,相隔一个大陆)之间的最短驾驶路线。其他主要城镇和城市都标有各自的机场代码;我们将在沿途发现许多这样的城镇和城市。

Dijkstra is often used to find real-world shortest paths (e.g., for navigation). Here’s an example. In Figure 7-2 we see a logical map of Australia. Our challenge is to discover the shortest driving route between Sydney on the east coast (marked SYD) and Perth, marked PER, which is a continent away, on the west coast. The other major towns and cities are marked with their respective airport codes; we’ll discover many of them along the way.

格德布 0702
图 7-2澳大利亚及其主干道网络的逻辑表示

从图 7-3中代表悉尼的节点开始,我们知道到悉尼的最短路径是 0 小时,因为我们已经到了那里。根据 Dijkstra 算法,现在只要我们知道从悉尼到悉尼的最短路径,悉尼问题就解决了。因此,我们将代表悉尼的节点灰化,添加路径长度 (0),并加粗节点的边框 — 在本示例的其余部分中,我们将保持这一惯例。

Starting at the node representing Sydney in Figure 7-3, we know the shortest path to Sydney is 0 hours, because we’re already there. In terms of Dijkstra’s algorithm, Sydney is now solved insofar as we know the shortest path from Sydney to Sydney. Accordingly, we’ve grayed out the node representing Sydney, added the path length (0), and thickened the node’s border — a convention that we’ll maintain throughout the remainder of this example.

从悉尼往外移动一级,我们的候选城市是布里斯班,位于北边 9 小时车程;澳大利亚首都堪培拉,位于西边 4 小时车程;墨尔本,位于南边 12 小时车程。

Moving one level out from Sydney, our candidate cities are Brisbane, which lies to the north by 9 hours; Canberra, Australia’s capital city, which lies 4 hours to the west; and Melbourne, which is 12 hours to the south.

我们能找到的最短路径是从悉尼到堪培拉,需要 4 个小时,因此我们认为堪培拉已经得到解决,如图7-4所示。

The shortest path we can find is Sydney to Canberra, at 4 hours, and so we consider Canberra to be solved, as shown in Figure 7-4.

格德布 0703
图 7-3从悉尼到悉尼的最短路径毫无意外地是 0 小时
格数据库 0704
图 7-4.堪培拉是距离悉尼最近的城市

我们已经解决的节点中的下一个节点是墨尔本,从悉尼经堪培拉到墨尔本需要 10 小时,从悉尼直飞则需要 12 小时,正如我们已经看到的。我们还有爱丽丝泉,从堪培拉到墨尔本需要 15 小时,从悉尼到墨尔本需要 19 小时,或者从布里斯班到悉尼需要 9 小时。

The next nodes out from our solved nodes are Melbourne, at 10 hours from Sydney via Canberra, or 12 hours from Sydney directly, as we’ve already seen. We also have Alice Springs, which is 15 hours from Canberra and 19 hours from Sydney, or Brisbane, which is 9 hours direct from Sydney.

因此,我们探索从悉尼到布里斯班的最短路径,即 9 小时,并认为布里斯班在 9 小时内得到解决,如图7-5所示。

Accordingly, we explore the shortest path, which is 9 hours from Sydney to Brisbane, and consider Brisbane solved at 9 hours, as shown in Figure 7-5.

格德布 0705
图 7-5.布里斯班是下一个最接近的城市

我们已解决的节点中的下一个相邻节点是墨尔本,从堪培拉出发需 10 小时,从悉尼沿另一条道路直达需 12 小时;凯恩斯,从悉尼经布里斯班出发需 31 小时;爱丽丝泉,经布里斯班出发需 40 小时,经堪培拉出发需 19 小时。

The next neighboring nodes from our solved ones are Melbourne, which is 10 hours via Canberra or 12 hours direct from Sydney along a different road; Cairns, which is 31 hours from Sydney via Brisbane; and Alice Springs, which is 40 hours via Brisbane or 19 hours via Canberra.

因此,我们选择最短路径,即从悉尼经堪培拉到墨尔本需要 10 小时。这比现有的 12 小时直达线路要短。现在我们认为墨尔本已经解决了,如图7-6所示。

Accordingly, we choose the shortest path, which is Melbourne, being 10 hours from Sydney via Canberra. This is shorter than the existing 12 hours direct link. We now consider Melbourne solved, as shown in Figure 7-6.

格德布 0706
图 7-6到达距离起始节点悉尼第三近的城市墨尔本

图 7-7中,我们已解决节点的下一层相邻节点是阿德莱德,距离悉尼 18 小时(经堪培拉和墨尔本);凯恩斯,距离悉尼 31 小时(经布里斯班);以及爱丽丝泉,距离悉尼经堪培拉 19 小时,经布里斯班 40 小时。我们选择阿德莱德,并认为它以 18 小时的代价得到解决。

In Figure 7-7, the next layer of neighboring nodes from our solved ones are Adelaide at 18 hours from Sydney (via Canberra and Melbourne); Cairns, at 31 hours from Sydney (via Brisbane); and Alice Springs, at 19 hours from Sydney via Canberra, or 40 hours via Brisbane. We choose Adelaide and consider it solved at a cost of 18 hours.


笔记

我们不考虑路径 墨尔本→悉尼 ,因为它的目的地是一个已解决的节点 —— 事实上,在这种情况下,它是起始节点悉尼

We don’t consider the path Melbourne→Sydney because its destination is a solved node — in fact, in this case, it’s the start node, Sydney.


我们已经解决的节点的下一层相邻节点是珀斯(我们的最终目的地),从悉尼经阿德莱德出发需 50 小时;爱丽丝泉,从悉尼经堪培拉出发需 19 小时,经阿德莱德出发需 33 小时;凯恩斯,从悉尼经布里斯班出发需 31 小时。

The next layer of neighboring nodes from our solved ones are Perth — our final destination — which is 50 hours from Sydney via Adelaide; Alice Springs, which is 19 hours from Sydney via Canberra or 33 hours via Adelaide; and Cairns, which is 31 hours from Sydney via Brisbane.

在这种情况下,我们选择爱丽丝泉,因为它拥有目前最短的路径,尽管从鸟瞰图来看,我们知道从阿德莱德到珀斯实际上最终会更短——只要问问任何路过的丛林人就知道了。我们的成本是 19 小时,如图7-8所示。

We choose Alice Springs in this case because it has the current shortest path, even though with a bird’s eye view we know that actually it’ll be shorter in the end to go from Adelaide to Perth — just ask any passing bushman. Our cost is 19 hours, as shown in Figure 7-8.

格德布 0707
图 7-7解决阿德莱德问题
格德布 0708
图 7-8。绕道穿过爱丽丝泉

在图 7-9中,我们已解决节点的下一层相邻节点是凯恩斯,经布里斯班 31 小时或经爱丽丝泉 43 小时,或达尔文 34 小时经爱丽丝泉,或珀斯 50 小时经阿德莱德。因此,我们将采用经布里斯班到凯恩斯的路线,并认为凯恩斯是从悉尼到凯恩斯最短的驾驶时间,31 小时。

In Figure 7-9, the next layer of neighboring nodes from our solved ones are Cairns at 31 hours via Brisbane or 43 hours via Alice Springs, or Darwin at 34 hours via Alice Springs, or Perth via Adelaide at 50 hours. So we’ll take the route to Cairns via Brisbane and consider Cairns solved with a shortest driving time from Sydney at 31 hours.

格德布 0709
图 7-9.回到东海岸的凯恩斯

在我们已解决的节点中,下一层相邻节点是从爱丽丝泉到达尔文需要 34 小时,经凯恩斯到达尔文需要 61 小时,经阿德莱德到珀斯需要 50 小时。因此,我们选择从爱丽丝泉到达尔文的路径,耗时 34 小时,并认为达尔文已解决,如图7-10所示。

The next layer of neighboring nodes from our solved ones are Darwin at 34 hours from Alice Springs, 61 hours via Cairns, or Perth via Adelaide at 50 hours. Accordingly, we choose the path to Darwin from Alice Springs at a cost of 34 hours and consider Darwin solved, as we can see in Figure 7-10.

最后,剩下的唯一相邻节点就是珀斯本身,如图7-11所示。从阿德莱德到达珀斯需要花费 50 小时,从达尔文到达珀斯需要花费 82 小时。因此,我们选择经由阿德莱德的路线,并认为从悉尼到达珀斯的最短路径为 50 小时。

Finally, the only neighboring node left is Perth itself, as we can see in Figure 7-11. It is accessible via Adelaide at a cost of 50 hours or via Darwin at a cost of 82 hours. Accordingly, we choose the route via Adelaide and consider Perth from Sydney solved at a shortest path of 50 hours.

格德布 0710
图 7-10.前往澳大利亚“最高端”的达尔文
格德布 0711
图 7-11.终于到达珀斯,距离悉尼仅 50 小时车程

Dijkstra 算法效果很好,但由于其探索是无向的,因此存在一些病态的图拓扑,可能会导致最坏情况下的性能问题。在这些情况下,我们探索的图比直观上必要的更多 — 在某些情况下,我们会探索整个图。由于每个可能的节点都是相对孤立地一次考虑的,因此算法必然会遵循直观上永远不会有助于最终最短路径的路径。

Dijkstra’s algorithm works well, but because its exploration is undirected, there are some pathological graph topologies that can cause worst-case performance problems. In these situations, we explore more of the graph than is intuitively necessary — in some cases, we explore the entire graph. Because each possible node is considered one at a time in relative isolation, the algorithm necessarily follows paths that intuitively will never contribute to the final shortest path.

尽管 Dijkstra 算法成功计算出了悉尼和珀斯之间的最短路径,但任何对地图有一定直觉的人都不会选择探索从阿德莱德向北的路线,因为感觉这条路更长。如果我们有某种启发式机制来指导我们,就像在最佳优先搜索中一样(例如,更喜欢向西而不是向东,更喜欢向南而不是向北),我们可能已经避免了本例中去布里斯班、凯恩斯、爱丽丝泉和达尔文的支线旅行。但最佳优先搜索是贪婪的,即使途中有障碍物(例如,土路),它也会尝试向目标节点移动。我们可以做得更好。

Despite Dijkstra’s algorithm having successfully computed the shortest path between Sydney and Perth, anyone with any intuition about map reading would likely not have chosen to explore the route northward from Adelaide because it feels longer. If we had some heuristic mechanism to guide us, as in a best-first search (e.g., prefer to head west over east, prefer south over north) we might have avoided the side-trips to Brisbane, Cairns, Alice Springs, and Darwin in this example. But best-first searches are greedy, and try to move toward the destination node even if there is an obstacle (e.g., a dirt track) in the way. We can do better.

A* 算法

The A* Algorithm

A*(读作“A-star”)算法改进了经典的 Dijkstra 算法。该算法基于这样的观察:一些搜索是知情的,通过知情,我们可以更好地选择通过图表的路径。在我们的例子中,知情搜索不会先穿越整个大陆到达尔文,然后从悉尼到珀斯。A* 与 Dijkstra 类似,因为它可以搜索图表的大部分区域,但它也类似于贪婪的最佳优先搜索,因为它使用启发式方法来指导它。A* 结合了 Dijkstra 算法的各个方面(优先选择靠近当前起点的节点)和最佳优先搜索(优先选择靠近目的地的节点),从而提供可证明的查找图表中最短路径的最佳解决方案。

The A* (pronounced “A-star”) algorithm improves on the classic Dijkstra algorithm. It is based on the observation that some searches are informed, and that by being informed we can make better choices over which paths to take through the graph. In our example, an informed search wouldn’t go from Sydney to Perth by traversing an entire continent to Darwin first. A* is like Dijkstra in that it can potentially search a large swathe of a graph, but it’s also like a greedy best-first search insofar as it uses a heuristic to guide it. A* combines aspects of Dijkstra’s algorithm, which prefers nodes close to the current starting point, and best-first search, which prefers nodes closer to the destination, to provide a provably optimal solution for finding shortest paths in a graph.

在 A* 中,我们将路径成本分为两部分:g(n) ,即从起点到某个节点n的路径成本;h(n),表示从节点n到目标节点的路径成本估算,由启发式算法(智能猜测)计算得出。A* 算法在迭代图形时平衡g(n)h(n),从而确保在每次迭代时选择总成本最低的节点f(n) = g(n) + h(n)

In A* we split the path cost into two parts: g(n), which is the cost of the path from the starting point to some node n; and h(n), which represents the estimated cost of the path from the node n to the destination node, as computed by a heuristic (an intelligent guess). The A* algorithm balances g(n) and h(n) as it iterates the graph, thereby ensuring that at each iteration it chooses the node with the lowest overall cost f(n) = g(n) + h(n).

正如我们所见,广度优先算法特别适合于路径查找。但它们还有其他用途。使用广度优先搜索作为迭代图的所有元素的方法,我们现在可以考虑图论中的一些有趣的高阶算法,这些算法可以连接数据的行为产生预测性洞察。

As we’ve seen, breadth-first algorithms are particularly good for path finding. But they have other uses as well. Using breadth-first search as our method for iterating over all elements of a graph, we can now consider a number of interesting higher-order algorithms from graph theory that yield predictive insight into the behavior of connected data.

图论与预测模型

Graph Theory and Predictive Modeling

图论一个成熟且易于理解的研究领域,涉及网络(或者从我们的角度来看,是连接数据)的性质。图论学家开发的分析技术可以应用于一系列有趣的问题。现在我们了解了低级遍历机制,例如广度优先搜索,我们可以开始考虑高阶分析。

Graph theory is a mature and well-understood field of study concerning the nature of networks (or from our point of view, connected data). The analytic techniques that have been developed by graph theoreticians can be brought to bear on a range of interesting problems. Now that we understand the low-level traversal mechanisms, such as breadth-first search, we can start to consider higher-order analyses.

图论技术广泛应用于各种问题。当我们第一次想深入了解一个新领域时,它们尤其有用——甚至了解可以从一个领域中提取什么样的见解。在这种情况下,我们可以直接应用一系列来自图论和社会科学的技术来获得洞察力。

Graph theory techniques are broadly applicable to a wide range of problems. They are especially useful when we first want to gain some insight into a new domain — or even understand what kind of insight it’s possible to extract from a domain. In such cases there are a range of techniques from graph theory and social sciences that we can straightforwardly apply to gain insight.

在接下来的几节中,我们将介绍社交图论中的一些关键概念。我们将根据社会学家 Mark Granovetter、David Easley 和 Jon Kleinberg 的著作,在社交领域的背景下介绍这些概念。1

In the next few sections we’ll introduce some of the key concepts in social graph theory. We’ll introduce these concepts in the context of a social domain based on the works of sociologists Mark Granovetter, and David Easley and Jon Kleinberg.1

三元闭包

Triadic Closures

三元闭包是社交图的一个常见属性,我们观察到,如果两个节点通过涉及第三个节点的路径连接,则这两个节点在未来某个时间点直接连接的可能性就会增加。这是一种常见的社交现象。如果我们碰巧与两个互不相识的人是朋友,那么这两个人在未来某个时间点成为直接朋友的可能性就会增加。我们与他们两人都是朋友这一事实本身就为彼此提供了直接成为朋友的手段和动机。也就是说,两人通过与我们一起闲逛而相遇的机会增加了,而且如果他们真的见面了,他们很有可能会根据对我们的相互信任和我们的友谊选择而相互信任。他们都是我们的朋友这一事实本身就表明,就彼此而言,他们可能在社交上相似。

A triadic closure is a common property of social graphs, where we observe that if two nodes are connected via a path involving a third node, there is an increased likelihood that the two nodes will become directly connected at some point in the future. This is a familiar social occurrence. If we happen to be friends with two people who don’t know one another, there’s an increased chance that those two people will become direct friends at some point in the future. The very fact that we are friends with both of them gives each the means and the motive to become friends directly. That is, there’s an increased chance the two will meet one another through hanging around with us, and a good chance that if they do meet, they’ll trust one another based on their mutual trust in us and our friendship choices. The very fact of their both being our friend is an indicator that with respect to each other they may be socially similar.

从他的分析中,格兰诺维特指出,子图支持如果节点A与另外两个节点BC有强关系,则具有强三元闭包性质。B和C之间至少有关系,也可能有关系。这是一个大胆的断言,它并不总是适用于图中的所有子图。尽管如此,它仍然很常见,特别是在社交网络中,可以作为一个可靠的预测指标。

From his analysis, Granovetter noted that a subgraph upholds the strong triadic closure property if it has a node A with strong relationships to two other nodes, B and C. B and C then have at least a weak, and potentially a strong, relationship between them. This is a bold assertion, and it won’t always hold for all subgraphs in a graph. Nonetheless, it is sufficiently commonplace, particularly in social networks, as to be a credible predictive indicator.

让我们看看强三元闭包性质如何在工作场所图中起到预测辅助作用。我们将从一个简单的组织层次结构开始,其中 Alice 管理 Bob 和 Charlie,但她的下属之间还没有任何联系,如图7-12所示。

Let’s see how the strong triadic closure property works as a predictive aid in a workplace graph. We’ll start with a simple organizational hierarchy in which Alice manages Bob and Charlie, but where there are not yet any connections between her subordinates, as shown in Figure 7-12.

格德布 0712
图 7-12。Alice管理 Bob 和 Charlie

对于工作场所来说,这是一个相当奇怪的情况。毕竟,Bob 和 Charlie 不太可能是完全陌生的人。如图7-13所示,无论他们是高级管理人员,因此是 Alice 高管管理下的同级人员,还是流水线工人,因此是担任领班的 Alice 手下的亲密同事,即使是非正式的,我们也可能期望 Bob 和 Charlie 有某种联系。

This is a rather strange situation for the workplace. After all, it’s unlikely that Bob and Charlie will be total strangers to one another. As shown in Figure 7-13, whether they’re high-level executives and therefore peers under Alice’s executive management or whether they’re assembly-line workers and therefore close colleagues under Alice acting as foreman, even informally we might expect Bob and Charlie to be somehow connected.

格德布 0713
图 7-13 Bob 和 Charlie 在 Alice 的指导下一起工作

因为 Bob 和 Charlie 都与 Alice 一起工作,所以他们最终合作的可能性很大,如图7-13所示。这与强三元闭包性质相一致,该性质表明 Bob 要么是 Charlie 的同事(我们称之为关系),要么 Bob 与 Charlie 一起工作(我们称之为关系)。在 Bob 和 Charlie 之间添加第三个WORKS_WITH或第三个PEER_OF关系可以闭合三角形 — 因此称为三元闭包

Because Bob and Charlie both work with Alice, there’s a strong possibility they’re going to end up working together, as we see in Figure 7-13. This is consistent with the strong triadic closure property, which suggests that either Bob is a peer of Charlie (we’ll call this a weak relationship) or that Bob works with Charlie (which we’ll term a strong relationship). Adding a third WORKS_WITH or PEER_OF relationship between Bob and Charlie closes the triangle — hence the term triadic closure.

来自社会学、公共卫生、心理学、人类学等许多领域的经验证据,甚至技术(例如 Facebook、Twitter、LinkedIn)表明,三元闭包趋势是真实且实质性的。这与传闻证据和情绪一致。但简单的几何学并不是这里起作用的全部:图中涉及的关系的质量也对稳定三元闭包的形成有重大影响。

The empirical evidence from many domains, including sociology, public health, psychology, anthropology, and even technology (e.g., Facebook, Twitter, LinkedIn), suggests that the tendency toward triadic closure is real and substantial. This is consistent with anecdotal evidence and sentiment. But simple geometry isn’t all that’s at work here: the quality of the relationships involved in a graph also have a significant bearing on the formation of stable triadic closures.

结构平衡

Structural Balance

如果我们回想一下图 7-12,我们直观地看到 Bob 和 Charlie 如何成为 Alice 管理下的同事(或同侪)。为了举例说明,我们假设 Bob 和 Charlie 的MANAGES关系有些消极(毕竟,人们不喜欢被人指使),而 BobPEER_OF和Charlie 的WORKS_WITH关系是积极的(因为人们通常喜欢他们的同事和与他们一起工作的人)。

If we recall Figure 7-12, it’s intuitive to see how Bob and Charlie can become coworkers (or peers) under Alice’s management. For example purposes, we’re going to make an assumption that the MANAGES relationship is somewhat negative (after all, people don’t like getting bossed around) whereas the PEER_OF and WORKS_WITH relationship are positive (because people generally like their peers and the folks they work with).

我们从前面关于强三元闭包原理的讨论中知道,在图 7-12中, Alice MANAGESBob 和 Charlie 应该形成一个三元闭包。也就是说,在没有任何其他约束的情况下,我们期望 Bob 和 Charlie 之间至少存在PEER_OFWORKS_WITH甚至 的MANAGES关系。

We know from our previous discussion on the strong triadic closure principle that in Figure 7-12 where Alice MANAGES Bob and Charlie, a triadic closure should be formed. That is, in the absence of any other constraints, we would expect at least a PEER_OF, a WORKS_WITH, or even a MANAGES relationship between Bob and Charlie.

MANAGES如果 Alice Bob 反过来又Charlie,则也存在创建三元闭包的类似趋势,如图 7-14WORKS_WITH所示。从轶事上看,这听起来很正确:如果 Bob 和 Charlie 一起工作,那么他们共享一个经理是有意义的,特别是如果组织似乎允许 Charlie 在没有管理监督的情况下工作。

A similar tendency toward creating a triadic closure exists if Alice MANAGES Bob who in turn WORKS_WITH Charlie, as we can see in Figure 7-14. Anecdotally this rings true: if Bob and Charlie work together it makes sense for them to share a manager, especially if the organization seemingly allows Charlie to function without managerial supervision.

格德布 0714
图 7-14。Alice管理与 Charlie 一起工作的 Bob

然而,盲目地应用强三元闭包原理可能会导致一些相当奇怪和令人不舒服的组织层级结构。例如,如果 Alice MANAGESBob 和 Charlie,但 Bob 也属于MANAGESCharlie,那么我们就会感到不满。没有人希望 Charlie 既受老板的管理,又受老板的老板的管理,如图7-15所示。

However, applying the strong triadic closure principle blindly can lead to some rather odd and uncomfortable-looking organization hierarchies. For instance, if Alice MANAGES Bob and Charlie but Bob also MANAGES Charlie, we have a recipe for discontent. Nobody would wish it upon Charlie that he’s managed both by his boss and his boss’s boss as in Figure 7-15.

格德布 0715
图 7-15。Alice管理 Bob 和 Charlie,而 Bob 也管理 Charlie

同样,如果 Bob 受 Alice 管理,同时又与 Alice 的同事 Charlie 共事,他会感到不舒服。如图7-16所示,这在组织层级之间显得很尴尬。这也意味着 Bob 永远无法在支持他的同事群体中安全地发泄对 Alice 管理风格的不满。

Similarly, it’s uncomfortable for Bob if he is managed by Alice while working with Charlie who is also Alice’s workmate. This cuts awkwardly across organization layers as we see in Figure 7-16. It also means Bob could never safely let off steam about Alice’s management style amongst a supportive peer group.

格德布 0716
图 7-16。Alice管理与 Charlie 一起工作的 Bob,同时也与 Charlie 一起工作

图 7-16中,查理既是老板的同级,又是其他员工的同级,这种尴尬的等级制度不太可能在社交上令人愉快,因此查理和爱丽丝会对此表示反对(要么想当老板,要么想当员工)。鲍勃的情况也类似,他不确定是否应该像对待他的经理爱丽丝那样对待查理(因为查理和爱丽丝是同级),还是将他视为自己的直接同级。

The awkward hierarchy in Figure 7-16 whereby Charlie is both a peer of the boss and a peer of another worker is unlikely to be socially pleasant, so Charlie and Alice will agitate against it (either wanting to be a boss or a worker). It’s similar for Bob who doesn’t know for sure whether to treat Charlie in the same way he treats his manager Alice (because Charlie and Alice are peers) or as his own direct peer.

很明显,图7-157-16中的三元闭包让我们感到不舒服,这违背了我们对结构对称和合理分层的天生偏好。这种偏好在图中被赋予了一个名字理论:结构平衡

It’s clear that the triadic closures in Figures 7-15 and 7-16 are uncomfortable to us, eschewing our innate preference for structural symmetry and rational layering. This preference is given a name in graph theory: structural balance.

MANAGES有趣的是,如果 Alice Bob 和 Charlie 之间有关系,那么就会有一个更容易接受、结构更平衡的三元闭包,但其中 Bob 和 Charlie 本身是通过关系联系在一起的同事,如图7-17WORKS_WITH所示。

Anecdotally, there’s a much more acceptable, structurally balanced triadic closure if Alice MANAGES Bob and Charlie, but where Bob and Charlie are themselves workmates connected by a WORKS_WITH relationship, as we can see in Figure 7-17.

格德布 0717
图 7-17。同事 Bob 和 Charlie 由 Alice 管理

同样的结构平衡也体现在同样可接受的三元闭包中,其中 Alice、Bob 和 Charlie 都是同事。在这种安排下,工人们都在一起工作,这可能是一种社会友好的安排,可以产生友情,如图7-18所示。

The same structural balance manifests itself in an equally acceptable triadic closure where Alice, Bob, and Charlie are all workmates. In this arrangement the workers are in it together, which can be a socially amicable arrangement that engenders camaraderie as in Figure 7-18.

格德布 0718
图 7-18。Alice、Bob 和 Charlie 都是同事

在图7-177-18中,三元闭包是惯用的,由三个WORKS_WITH关系或两个关系MANAGES和一个WORKS_WITH关系构成。它们是所有平衡的三元闭包。为了理解平衡和不平衡的三元闭包的含义,我们将通过声明关系在WORKS_WITH社会上是积极的(因为同事花费大量时间进行互动),而MANAGES关系在社会上是消极的,因为经理总体上花费较少的时间与他们负责的个人互动,从而为模型添加更多的语义丰富性。

In Figures 7-17 and 7-18, the triadic closures are idiomatic and constructed with either three WORKS_WITH relationships or two MANAGES and a single WORKS_WITH relationship. They are all balanced triadic closures. To understand what it means to have balanced and unbalanced triadic closures, we’ll add more semantic richness to the model by declaring that the WORKS_WITH relationship is socially positive (because coworkers spend a lot of time interacting), whereas MANAGES is a negative relationship because managers spend overall less of their time interacting with individuals in their charge.

鉴于积极和消极情绪的这一新维度,我们现在可以问“这些平衡结构有什么特别之处?”很明显,强大的三元闭包仍然在发挥作用,但这不是唯一的驱动因素。在这种情况下,结构平衡WORKS_WITH的概念也会产生影响。结构平衡的三元闭包由所有强烈情绪(我们的或PEER_OF关系)或两个具有消极情绪(MANAGES在我们的例子中)的关系与一个积极关系组成。

Given this new dimension of positive and negative sentiment, we can now ask the question “What is so special about these balanced structures?” It’s clear that strong triadic closure is still at work, but that’s not the only driver. In this case the notion of structural balance also has an effect. A structurally balanced triadic closure consists of relationships of all strong sentiments (our WORKS_WITH or PEER_OF relationships) or two relationships having negative sentiments (MANAGES in our case) with a single positive relationship.

我们在现实世界中经常看到这种情况。如果我们有两个好朋友,那么社会压力就会促使这些好朋友自己成为好朋友。这两个朋友自己是敌人的情况很不寻常,因为这会给我们所有的友谊带来压力。一个朋友不能向我们表达他对另一个朋友的厌恶,因为另一个人也是我们的朋友!鉴于这些压力,结果之一就是该群体将解决分歧并出现好朋友。这会将我们不平衡的三元闭包(两个关系具有积极情绪和一个消极情绪)改变为平衡闭包,因为所有关系都会具有积极情绪,就像我们的协作方案一样,其中 Alice、Bob 和 Charlie 都在图 7-18中一起工作。

We see this often in the real world. If we have two good friends, then social pressure tends toward those good friends themselves becoming good friends. It’s unusual that those two friends themselves are enemies because that puts a strain on all our friendships. One friend cannot express his dislike of the other to us, because the other person is our friend too! Given those pressures, it’s one outcome that the group will resolve its differences and good friends will emerge. This would change our unbalanced triadic closure (two relationships with positive sentiments and one negative) to a balanced closure because all relationships would be of a positive sentiment much like our collaborative scheme where Alice, Bob, and Charlie all work together in Figure 7-18.

然而,一个合理的(尽管可能不那么令人愉快)结果是,我们在“朋友”之间的争论中站队,建立两种带有负面情绪的关系——实际上是联合起来对付一个人。现在我们可以八卦一下我们对前朋友的共同厌恶,然后关系又恢复平衡了。同样,我们看到这反映在组织场景中,爱丽丝通过管理鲍勃和查理,实际上成为了他们工作场所的敌人,如图7-17所示。

However, the plausible (though arguably less pleasant) outcome would be where we take sides in the dispute between our “friends,” creating two relationships with negative sentiments — effectively ganging up on an individual. Now we can engage in gossip about our mutual dislike of a former friend and the closure again becomes balanced. Equally we see this reflected in the organizational scenario where Alice, by managing Bob and Charlie, becomes, in effect, their workplace enemy as in Figure 7-17.

平衡闭包为图表的预测能力增加了另一个维度。只需寻找在图表中创建平衡闭包的机会,即使是在非常大的规模上,我们也可以修改图表结构以进行准确的预测分析。但我们可以更进一步,在下一节中,我们将引入 本地桥梁的概念,它为我们提供了对组织沟通流程的宝贵见解,从这些知识中,我们能够调整它以满足未来的需求挑战。

Balanced closures add another dimension to the predictive power of graphs. Simply by looking for opportunities to create balanced closures across a graph, even at very large scale, we can modify the graph structure for accurate predictive analyses. But we can go further, and in the next section we’ll bring in the notion of local bridges, which give us valuable insight into the communications flow of our organization, and from that knowledge comes the ability to adapt it to meet future challenges.

本地桥梁

Local Bridges

我们使用的只有三个人的组织是不正常的,我们在本节中研究的图表最好被视为更大组织层次结构的一部分的小子图。当我们开始考虑管理一个更大的组织时,我们期望一个更复杂的图表结构,但我们也可以将其他启发式方法应用于该结构,以帮助理解业务。事实上,一旦我们将组织的其他部分引入图表,我们就可以根据局部作用的强三元闭包原理推断出图表的全局属性。

An organization of only three people as we’ve been using is anomalous, and the graphs we’ve studied in this section are best thought of as small subgraphs as part of a larger organizational hierarchy. When we start to consider managing a larger organization we expect a much more complex graph structure, but we can also apply other heuristics to the structure to help make sense of the business. In fact, once we have introduced other parts of the organization into the graph, we can reason about global properties of the graph based on the locally acting strong triadic closure principle.

在图 7-19中,我们看到了一个违反直觉的场景,即组织中的两个组分别由 Alice 和 Davina 管理。然而,我们的结构略显尴尬,Alice 不仅与 Bob 和 Charlie 一起管理一个团队,还管理 Davina。虽然这并非不可能(Alice确实可能承担这样的责任),但从组织设计的角度来看,直觉上感觉很尴尬。

In Figure 7-19, we’re presented with a counterintuitive scenario where two groups in the organization are managed by Alice and Davina, respectively. However, we have the slightly awkward structure that Alice not only runs a team with Bob and Charlie, but also manages Davina. Though this isn’t beyond the realm of possibility (Alice may indeed have such responsibilities), it feels intuitively awkward from an organizational design perspective.

格德布 0719
图 7-19。爱丽丝的线路管理责任存在偏差

从图论的角度来看,这也不太可能。因为 Alice 参与了两个强关系,她与MANAGESCharlie(和 Bob)以及MANAGESDavina,所以我们自然希望通过至少添加PEER_OFDavina 与 Charlie(和 Bob)之间的关系来创建三元闭包。但 Alice 还参与了与Davina 的本地桥梁- 他们一起是组织中各组之间的唯一沟通路径。拥有 Alice Davina 关系MANAGES意味着我们实际上必须创建闭包。这两个属性 - 本地桥梁和强三元闭包 - 是相反的。

From a graph theory perspective it’s also unlikely. Because Alice participates in two strong relationships, she MANAGES Charlie (and Bob) and MANAGES Davina, naturally we’d like to create a triadic closure by adding at least a PEER_OF relationship between Davina and Charlie (and Bob). But Alice is also involved in a local bridge to Davina — together they’re a sole communication path between groups in the organization. Having the relationship Alice MANAGES Davina means we’d in fact have to create the closure. These two properties — local bridge and strong triadic closure — are in opposition.

然而,如果 Alice 和 Davina 是同龄人(弱关系),那么强三元闭包原理就不会被激活,因为只有一种强关系——与 Bob(或 Charlie)的关系——并且局部桥接属性是有效的,正如我们在图 7-20MANAGES中看到的那样。

Yet if Alice and Davina are peers (a weak relationship), then the strong triadic closure principle isn’t activated because there’s only one strong relationship — the MANAGES relationship to Bob (or Charlie) — and the local bridge property is valid as we can see in Figure 7-20.

这个本地桥梁的有趣之处在于,它描述了我们组织中各组之间的沟通渠道。这样的渠道对于我们企业的活力极为重要。特别是,为了确保公司的健康,我们会确保本地桥梁关系健康活跃,或者同样,我们可能会密切关注本地桥梁,以确保不会发生任何不当行为(挪用公款、欺诈等)。

What’s interesting about this local bridge is that it describes a communication channel between groups in our organization. Such channels are extremely important to the vitality of our enterprise. In particular, to ensure the health of our company we’d make sure that local bridge relationships are healthy and active, or equally we might keep an eye on local bridges to ensure that no impropriety (embezzlement, fraud, etc.) occurs.

格德布 0720
图 7-20。Alice和 Davina 通过本地桥连接

本地桥是弱链接这一特性(PEER_OF在我们的示例组织中)是整个社交图中普遍存在的特性。这意味着我们可以开始根据经验得出的本地桥和强三元闭包概念对我们的组织将如何发展进行预测分析。因此,给定一个任意的组织图,我们可以看到业务结构可能如何发展,并为这些可能发生的情况制定计划。

This same property of local bridges being weak links (PEER_OF in our example organization) is a property that is prevalent throughout social graphs. This means we can start to make predictive analyses of how our organization will evolve based on empirically derived local bridge and strong triadic closure notions. So given an arbitrary organizational graph, we can see how the business structure is likely to evolve and plan for those eventualities.

概括

Summary

图表是一种真正了不起的结构。我们对它们的理解植根于数百年的数学和科学研究。然而,我们才刚刚开始了解如何将它们应用于我们的个人、社交和商业生活。这项技术已经存在,以现代图表数据库的形式向所有人开放。机会无穷无尽。

Graphs are truly remarkable structures. Our understanding of them is rooted in hundreds of years of mathematical and scientific study. And yet we’re only just beginning to understand how to apply them to our personal, social, and business lives. The technology is here, open and available to all in the form of the modern graph database. The opportunities are endless.

正如我们在本书中看到的,图论算法和分析技术并不难。我们只需要了解如何应用它们来实现我们的目标。我们在本书的结尾发出了一个简单的号召:拥抱图形和图形数据库。利用你所学到的关于使用图形建模、图形数据库架构、设计和实现图形数据库解决方案以及将图形算法应用于复杂业务问题的知识,去构建下一个真正具有开创性的信息系统。

As we’ve seen throughout this book, graph theory algorithms and analytical techniques are not demanding. We need only understand how to apply them to achieve our goals. We leave this book with a simple call to arms: embrace graphs and graph databases. Take all that you’ve learned about modeling with graphs, graph database architecture, designing and implementing a graph database solution, and applying graph algorithms to complex business problems, and go build the next truly pioneering information system.

1特别参阅 Granovetter 关于社会群体中弱关系强度的关键著作: http: //stanford.io/17XjisT。有关 Easley 和 Kleinberg 的著作,请参阅http://bit.ly/13e0ZuZ

1 In particular, see Granovetter’s pivotal work on the strength of weak ties in social communities: http://stanford.io/17XjisT. For Easley and Kleinberg, see http://bit.ly/13e0ZuZ.

附录 A. NOSQL 概述

Appendix A. NOSQL Overview

近年来,一种名为NOSQL ( Not Only SQL的俏皮缩写,或者更直白地说, No to SQL )的数据存储技术系列迅速流行起来。但是,NOSQL 这个术语定义的是这些数据存储不是什么 — 它们不是以 SQL 为中心的关系数据库 — 而不是它们是什么。它们是一组有趣且有用的存储技术,其操作、功能和架构特性多种多样。

Recent years have seen a meteoric rise in the popularity of a family of data storage technologies known as NOSQL (a cheeky acronym for Not Only SQL, or more confrontationally, No to SQL). But NOSQL as a term defines what those data stores are not — they’re not SQL-centric relational databases — rather than what they are, which is an interesting and useful set of storage technologies whose operational, functional, and architectural characteristics are many and varied.

为什么要创建这些新数据库?它们解决了哪些问题?在这里,我们将讨论过去十年中出现的一些新数据挑战。然后,我们将研究四个 NOSQL 数据库系列,包括图形数据库。

Why were these new databases created? What problems do they address? Here we’ll discuss some of the new data challenges that have emerged in the past decade. We’ll then look at four families of NOSQL databases, including graph databases.

NOSQL 的崛起

The Rise of NOSQL

从历史上看,大多数企业级 Web 应用都运行在关系数据库之上。但在过去十年中,我们面临的数据量越来越大,变化越来越快,结构也越来越多样化,超出了传统 RDBMS 部署所能处理的范围。NOSQL 运动就是为了应对这些挑战而兴起的。

Historically, most enterprise-level web apps ran on top of a relational database. But in the past decade, we’ve been faced with data that is bigger in volume, changes more rapidly, and is more structurally varied than can be dealt with by traditional RDBMS deployments. The NOSQL movement has arisen in response to these challenges.

毫不奇怪,随着存储量的急剧增加,容量成为组织采用 NOSQL 存储的主要驱动力。容量可以简单地定义为存储数据的大小

It’s no surprise that as storage has increased dramatically, volume has become the principal driver behind the adoption of NOSQL stores by organizations. Volume may be defined simply as the size of the stored data.

众所周知,大型数据集存储在关系数据库中会变得难以处理。特别是,随着表的大小和连接数量的增加(所谓的连接痛苦),查询执行时间也会增加。这这不是数据库本身的错误。相反,这是底层数据模型的一个方面,该模型会构建查询的所有可能答案的集合,然后进行筛选以得出正确的解决方案。

As is well known, large datasets become unwieldy when stored in relational databases. In particular, query execution times increase as the size of tables and the number of joins grow (so-called join pain). This isn’t the fault of the databases themselves. Rather, it is an aspect of the underlying data model, which builds a set of all possible answers to a query before filtering to arrive at the correct solution.

为了避免连接和连接之痛,从而更好地处理超大数据集,NOSQL 世界采用了几种关系模型的替代方案。虽然这些替代模型更擅长处理超大数据集,但它们的表现力往往不如关系模型(图形模型除外,它的表现力实际上更强)。

In an effort to avoid joins and join pain, and thereby cope better with extremely large datasets, the NOSQL world has adopted several alternatives to the relational model. Though more adept at dealing with very large datasets, these alternative models tend to be less expressive than the relational one (with the exception of the graph model, which is actually more expressive).

但现代面向网络的系统要处理的问题不仅仅是数据量。除了数据量大之外,当今的数据变化也非常迅速。速度数据随时间变化的速率。

But volume isn’t the only problem modern web-facing systems have to deal with. Besides being big, today’s data often changes very rapidly. Velocity is the rate at which data changes over time.

速度很少是一个静态指标。系统的内部和外部变化以及系统使用的环境会对速度产生相当大的影响。由于数据量大,变化的速度要求数据存储不仅要处理持续的高写入负载,还要处理峰值。

Velocity is rarely a static metric. Internal and external changes to a system and the context in which it is employed can have considerable impact on velocity. Coupled with high volume, variable velocity requires data stores to not only handle sustained levels of high write loads, but also deal with peaks.

速度还有另一个方面,即数据结构变化的速率。换句话说,除了特定属性的值发生变化之外,承载这些属性的元素的整体结构也会发生变化。这种情况通常有两个原因。第一是快速变化的业务动态。随着业务的变化,其数据需求也在变化。第二,数据获取通常是一种实验性的事情。一些属性是“以防万一”捕获的,其他属性则是在以后根据变化的需求引入的。那些对业务有价值的属性会保留下来,其他的则会被抛在一边。这两种形式的速度在关系世界中都是有问题的,因为高写入负载会导致高处理成本,而高模式波动性会导致高运营成本。

There is another aspect to velocity, which is the rate at which the structure of the data changes. In other words, in addition to the value of specific properties changing, the overall structure of the elements hosting those properties can change as well. This commonly occurs for two reasons. The first is fast-moving business dynamics. As the business changes, so do its data needs. The second is that data acquisition is often an experimental affair. Some properties are captured “just in case,” others are introduced at a later point based on changed needs. The ones that prove valuable to the business stay around, others fall by the wayside. Both these forms of velocity are problematic in the relational world, where high write loads translate into a high processing cost, and high schema volatility has a high operational cost.

尽管评论者后来在最初的规模追求中加入了其他有用的要求,但最终的关键方面是认识到数据远比我们在关系世界中处理的数据更加多样化。对于存在性证明,想想我们表中的所有那些空值和我们代码中的空值检查。这驱散了最终被广泛认可的方面、多样性,我们将其定义为数据结构的规则性或不规则性、密集性或稀疏性、连接性或断开性的程度。

Although commentators have later added other useful requirements to the original quest for scale, the final key aspect is the realization that data is far more varied than the data we’ve dealt with in the relational world. For existential proof, think of all those nulls in our tables and the null checks in our code. This has driven out the final widely agreed upon facet, variety, which we define as the degree to which data is regularly or irregularly structured, dense or sparse, connected or disconnected.

酸与碱

ACID versus BASE

当我们第一次接触 NOSQL 时,我们通常会联想到许多人都已经熟悉的关系数据库。尽管我们知道数据和查询模型会有所不同(毕竟没有 SQL),但 NOSQL 存储使用的一致性模型也可能与关系数据库使用的模型大不相同。许多 NOSQL 数据库使用不同的一致性模型来支持前面讨论的数据量、速度和种类的差异。

When we first encounter NOSQL it’s often in the context of what many of us are already familiar with: relational databases. Although we know the data and query model will be different (after all, there’s no SQL), the consistency models used by NOSQL stores can also be quite different from those employed by relational databases. Many NOSQL databases use different consistency models to support the differences in volume, velocity, and variety of data discussed earlier.

让我们探索一下有哪些一致性功能可以帮助保证数据安全,以及在使用(大多数)NOSQL 存储时需要进行哪些权衡。1

Let’s explore what consistency features are available to help keep data safe, and what trade-offs are involved when using (most) NOSQL stores.1

在关系数据库世界中,我们都熟悉ACID事务,这已经成为一段时间以来的常态。ACID 保证为我们提供了一个安全的数据操作环境:

In the relational database world, we’re all familiar with ACID transactions, which have been the norm for some time. The ACID guarantees provide us with a safe environment in which to operate on data:

原子
全部事务中的操作都成功,否则所有操作都回滚。
All operations in a transaction succeed or every operation is rolled back.
持续的
事务完成,数据库结构健全。
On transaction completion, the database is structurally sound.
孤立
交易不会相互竞争。数据库会调节对状态的竞争访问,以便事务看起来是按顺序运行的。
Transactions do not contend with one another. Contentious access to state is moderated by the database so that transactions appear to run sequentially.
耐用的
即使出现故障,应用事务的结果是永久性的。
The results of applying a transaction are permanent, even in the presence of failures.

这些属性意味着,一旦事务完成,其数据在磁盘上(或多个磁盘,或者实际上在多个不同的内存位置)是一致的(所谓的写入一致性)和稳定的。对于应用程序开发人员来说,这是一个很棒的抽象,但需要复杂的锁定,这可能会导致逻辑不可用,并且通常被认为是大多数用例的重量级模式。

These properties mean that once a transaction completes, its data is consistent (so-called write consistency) and stable on disk (or disks, or indeed in multiple distinct memory locations). This is a wonderful abstraction for the application developer, but requires sophisticated locking, which can cause logical unavailability, and is typically considered to be a heavyweight pattern for most use cases.

对于许多领域来说,ACID 事务比该领域实际需要的要悲观得多。在 NOSQL 世界中,ACID 事务已经过时了,因为存储放松了对即时一致性、数据新鲜度和准确性的要求,以获得其他好处,例如规模和弹性。不使用 ACID,而是使用术语BASE作为描述更乐观的存储策略属性的流行方式:

For many domains, ACID transactions are far more pessimistic than the domain actually requires. In the NOSQL world, ACID transactions have gone out of fashion as stores loosen the requirements for immediate consistency, data freshness, and accuracy in order to gain other benefits, like scale and resilience. Instead of using ACID, the term BASE has arisen as a popular way of describing the properties of a more optimistic storage strategy:

基本可用性
商店似乎大部分时间都在工作。
The store appears to work most of the time.
软状态
商店不必具有写一致性,不同的副本也不必始终保持相互一致性。
Stores don’t have to be write-consistent, nor do different replicas have to be mutually consistent all the time.
最终一致性
商店在稍后的某个时间点表现出一致性(例如,在读取时懒惰地)。
Stores exhibit consistency at some later point (e.g., lazily at read time).

BASE 属性比 ACID 保证要宽松得多,并且它们之间没有直接映射。BASE 存储重视可用性(因为这是规模的核心构建块),但不保证写入时的副本一致性。BASE 存储提供的保证不太严格:数据在未来将是一致的,可能是在读取时(例如 Riak),或者将始终保持一致,但仅限于某些已处理的过去快照(例如 Datomic)。

The BASE properties are considerably looser than the ACID guarantees, and there is no direct mapping between them. A BASE store values availability (because that is a core building block for scale), but does not offer guaranteed consistency of replicas at write time. BASE stores provide a less strict assurance: that data will be consistent in the future, perhaps at read time (e.g., Riak), or will always be consistent, but only for certain processed past snapshots (e.g., Datomic).

鉴于对一致性的支持如此松散,我们作为开发人员在考虑数据一致性时需要更加了解和严谨。我们必须熟悉所选存储的基本行为并在这些约束内工作。在应用程序级别,我们必须根据具体情况选择是否接受可能不一致的数据,或者是否指示数据库在读取时提供一致的数据,同时承担由此产生的延迟损失。(为了保证一致的读取,数据库需要比较数据元素的所有副本,如果结果不一致,甚至需要对该数据进行补救性修复。)从开发的角度来看,这与依靠事务来为我们管理一致状态的简单性相去甚远,虽然这不一定是坏事,但确实需要付出努力。

Given such loose support for consistency, we as developers need to be more knowledgable and rigorous when considering data consistency. We must be familiar with the BASE behavior of our chosen stores and work within those constraints. At the application level we must choose on a case-by-case basis whether we will accept potentially inconsistent data, or whether we will instruct the database to provide consistent data at read time, while incurring the latency penalty that that implies. (In order to guarantee consistent reads, the database will need to compare all replicas of a data element, and in an inconsistent outcome even perform remedial repair work on that data.) From a development perspective this is a far cry from the simplicity of relying on transactions to manage consistent state on our behalf, and though that’s not necessarily a bad thing, it does require effort.

NOSQL 象限

The NOSQL Quadrants

讨论了支撑 NOSQL 存储一致性的 BASE 模型后,我们准备开始研究众多用户级数据模型。为了消除这些模型的歧义,我们设计了一个简单的分类法,如图A-1所示。该分类法将当代 NOSQL 空间划分为四个象限。每个象限中的存储处理不同类型的功能用例 — — 尽管非功能性需求也会极大地影响我们对数据库的选择。

Having discussed the BASE model that underpins consistency in NOSQL stores, we’re ready to start looking at the numerous user-level data models. To disambiguate these models, we’ve devised a simple taxonomy, as shown in Figure A-1. This taxonomy divides the contemporary NOSQL space into four quadrants. Stores in each quadrant address a different kind of functional use case — though nonfunctional requirements can also strongly influence our choice of database.

在以下章节中,我们将处理每个象限,重点介绍数据模型的特征、操作方面和采用的驱动因素。

In the following sections we’ll deal with each of these quadrants, highlighting the characteristics of the data model, operational aspects, and drivers for adoption.

文档存储

Document Stores

对于习惯使用分层结构文档的开发人员来说,文档数据库提供了最熟悉的范例。文档数据库存储和检索文档,就像电子文件柜一样。文档往往包含地图和列表,允许自然层次结构 - 就像我们习惯使用的 JSON 和 XML 等格式一样。

Document databases offer the most immediately familiar paradigm for developers used to working with hierarchically structured documents. Document databases store and retrieve documents, just like an electronic filing cabinet. Documents tend to comprise maps and lists, allowing for natural hierarchies — much as we’re used to with formats like JSON and XML.

grdb aa01
图 A-1. NOSQL 存储象限

在最简单的层面上,可以通过 ID 存储和检索文档。如果应用程序记住它感兴趣的 ID(例如用户名),文档存储就可以像键值存储一样工作(稍后我们将详细介绍)。但在一般情况下,文档存储依靠索引来根据文档的属性方便地访问文档。例如,在电子商务场景中,我们可能会使用索引来表示不同的产品类型,以便将它们提供给潜在卖家,如图A-2所示。通常,索引用于从存储中检索相关文档集以供应用程序使用。

At the simplest level, documents can be stored and retrieved by ID. Providing an application remembers the IDs it’s interested in (e.g., usernames), a document store can act much like a key-value store (of which we’ll see more later). But in the general case, document stores rely on indexes to facilitate access to documents based on their attributes. For example, in an ecommerce scenario, we might use indexes to represent distinct product types so that they can be offered up to potential sellers, as shown in Figure A-2. In general, indexes are used to retrieve sets of related documents from the store for an application to use.

与关系数据库中的索引非常相似,文档存储中的索引使我们能够以写入性能换取更高的读取性能。写入成本更高,因为它们还维护索引,但读取需要扫描更少的记录来查找相关数据。对于写入密集的记录,值得注意的是索引实际上可能会降低整体性能。

Much like indexes in relational databases, indexes in a document store enable us to trade write performance for greater read performance. Writes are more costly, because they also maintain indexes, but reads require scanning fewer records to find pertinent data. For write-heavy records, it’s worth bearing in mind that indexes might actually degrade performance overall.

如果数据没有被索引,查询通常会慢得多,因为必须对数据集进行全面搜索。这显然是一项昂贵的任务,应尽可能避免——正如我们将看到的,文档数据库用户通常会将这种处理外部化到并行计算框架中,而不是在内部处理这些查询。

Where data hasn’t been indexed, queries are typically much slower, because a full search of the dataset has to happen. This is obviously an expensive task and is to be avoided wherever possible — and as we shall see, rather than process these queries internally, it’s normal for document database users to externalize this kind of processing in parallel compute frameworks.

grdb aa02
图 A-2.索引具体化了文档存储中的实体集

由于文档存储的数据模型是断开连接的实体之一,因此文档存储往往具有有趣且有用的操作特性。它们应该可以水平扩展,因为在写入时相互独立的记录之间不存在争用状态,并且不需要跨副本进行事务处理。

Because the data model of a document store is one of disconnected entities, document stores tend to have interesting and useful operational characteristics. They should scale horizontally, due to there being no contended state between mutually independent records at write time, and no need to transact across replicas.

对于写入,文档数据库过去提供的事务性仅限于单个记录级别。也就是说,文档数据库将确保对单个文档的写入是原子的——假设管理员在设置数据库时选择了安全的持久性级别。此类别中正在出现对跨文档进行原子操作的支持,但尚未成熟。在没有多键事务的情况下,应用程序开发人员需要在应用程序代码中编写补偿逻辑。

For writes, document databases have, historically, provided transactionality limited to the level of an individual record. That is, a document database will ensure that writes to a single document are atomic — assuming the administrator has opted for safe levels of persistence when setting up the database. Support for operating across sets of documents atomically is emerging in this category, but it is not yet mature. In the absence of multikey transactions, it is down to application developers to write compensating logic in application code.

由于存储的文档没有连接(除非通过索引),因此有许多乐观并发控制机制可用于帮助协调单个文档的并发争用写入,而无需诉诸严格的锁定。事实上,一些文档存储(如CouchDB 已将此作为其价值主张的一个关键点:文档可以保存在多主数据库中,该数据库会自动在实例之间复制并发访问和争用的状态,而不会受到用户的过度干扰。

Because stored documents are not connected (except through indexes), there are numerous optimistic concurrency control mechanisms that can be used to help reconcile concurrent contending writes for a single document without having to resort to strict locks. In fact, some document stores (like CouchDB) have made this a key point of their value proposition: documents can be held in a multimaster database that automatically replicates concurrently accessed, contended state across instances without undue interference from the user.

在其他存储中,数据库管理系统可能还能够区分和协调对文档不同部分的写入,甚至使用时间戳将多个争用写入协调为单个逻辑上一致的结果。这是一种合理的乐观权衡,因为它通过使用乐观地控制存储的替代机制减少了对事务的一些需求,同时努力提供更低的延迟和更高的吞吐量。

In other stores, the database management system may also be able to distinguish and reconcile writes to different parts of a document, or even use timestamps to reconcile several contended writes into a single logically consistent outcome. This is a reasonable optimistic trade-off insofar as it reduces some of the need for transactions by using alternative mechanisms that optimistically control storage while striving to provide lower latency and higher throughput.

键值存储

Key-Value Stores

键值存储是文档存储家族的表亲,但它们的血统来自来自亚马逊的 Dynamo 数据库。它们的作用类似于大型分布式哈希图数据结构,可以通过键存储和检索不透明值。

Key-value stores are cousins of the document store family, but their lineage comes from Amazon’s Dynamo database. They act like large, distributed hashmap data structures that store and retrieve opaque values by key.

如图A-3所示,哈希表的键空间分布在网络上的众多存储桶中。出于容错原因,每个存储桶都复制到多台机器上。所需副本数的公式为R = 2F +1,其中F是我们可以容忍的故障数。复制算法旨在确保机器不是彼此的精确副本。这允许系统在机器及其存储桶恢复时进行负载平衡。它还有助于避免热点,热点可能会导致无意的自我拒绝服务。

As shown in Figure A-3, the key space of the hashmap is spread across numerous buckets on the network. For fault-tolerance reasons, each bucket is replicated onto several machines. The formula for number of replicas required is given by R = 2F +1, where F is the number of failures we can tolerate. The replication algorithm seeks to ensure that machines aren’t exact copies of each other. This allows the system to load-balance while a machine and its buckets recover. It also helps avoid hotspots, which can cause inadvertent self denial-of-service.

从客户端的角度来看,键值存储易于使用。客户端通过对域特定标识符(键)进行哈希处理来存储数据元素。哈希函数经过精心设计,可在可用存储桶中提供均匀分布,从而确保没有任何一台机器成为热点。给定哈希键,客户端可以使用该地址将值存储在相应的存储桶中。客户端使用类似的过程来检索存储的值。

From the client’s point of view, key-value stores are easy to use. A client stores a data element by hashing a domain-specific identifier (key). The hash function is crafted such that it provides a uniform distribution across the available buckets, thereby ensuring that no single machine becomes a hotspot. Given the hashed key, the client can use that address to store the value in a corresponding bucket. Clients use a similar process to retrieve stored values.

grdb aa03
图 A-3.键值存储的作用类似于分布式哈希图数据结构

有了这样的模型,希望在键值存储中存储数据或从中检索数据的应用程序只需知道(或计算)相应的密钥即可。尽管密钥集中有大量可能的密钥,但实际上密钥往往会很自然地从应用程序域中脱落。用户名和电子邮件地址、名胜古迹的笛卡尔坐标、社会安全号码和邮政编码都是各种域的自然密钥。使用设计合理的系统,由于缺少密钥而导致存储中数据丢失的可能性很低。

Given such a model, applications wishing to store data in, or retrieve data from, a key-value store need only know (or compute) the corresponding key. Although there is a very large number of possible keys in the key set, in practice keys tend to fall out quite naturally from the application domain. Usernames and email addresses, Cartesian coordinates for places of interest, Social Security numbers, and zip codes are all natural keys for various domains. With a sensibly designed system, the chance of losing data in the store due to a missing key is low.

键值数据模型与文档数据模型类似。它们的区别在于各自对数据的洞察程度。

The key-value data model is similar to the document data model. What differentiates them is the level of insight each offers into its data.

理论上,键值存储对其值中包含的信息毫不在意。纯键值存储只关心如何高效地为应用程序存储和检索不透明数据,不受其性质和应用程序用途的限制。

In theory, key-value stores are oblivious to the information contained in their values. Pure key-value stores simply concern themselves with efficient storage and retrieval of opaque data on behalf of applications, unencumbered by its nature and application usage.

实际上,这种区别并不总是那么明显。一些流行的键值存储(例如 Riak)也提供了对某些类型的结构化存储数据(如 XML 和 JSON)的可见性。它还支持一些核心数据类型(称为 CRDT),即使在存在并发写入的情况下也可以放心合并。因此,在产品级别,文档和键值存储之间存在一些重叠。

In practice, such distinctions aren’t always so clear-cut. Some of the popular key-value stores — Riak, for instance — also offer visibility into certain types of structured stored data like XML and JSON. It also supports some core data types (called CRDTs) that can be confidently merged even in the presence of concurrent writes. At a product level, then, there is some overlap between the document and key-value stores.

尽管键值模型很简单,但它与文档模型一样,无法为应用程序开发人员提供太多的数据洞察。为了从各个记录中检索有用的信息集,我们通常使用外部处理基础结构,例如MapReduce。与在数据存储中执行查询相比,这具有很高的潜伏性。

Although simple, the key-value model, much as the document model, offers little in the way of data insight to the application developer. To retrieve sets of useful information from across individual records, we typically use an external processing infrastructure, such as MapReduce. This is highly latent compared to executing queries in the data store.

键值存储具有一定的操作和规模优势。它们源自亚马逊的 Dynamo 数据库(一个专为不间断购物车服务而设计的平台),它们往往针对高可用性和规模进行了优化。或者,正如亚马逊团队所说,即使“磁盘出现故障、网络路由出现故障或数据中心被龙卷风摧毁”,它们也应该能够正常工作。

Key-value stores offer certain operational and scale advantages. Descended as they are from Amazon’s Dynamo database — a platform designed for a nonstop shopping cart service — they tend to be optimized for high availability and scale. Or, as the Amazon team puts it, they should work even “if disks are failing, network routes are flapping, or data centers are being destroyed by tornados.”

列族

Column Family

专栏家庭商店以Google 的 BigTable为模型。数据模型基于稀疏填充表,其行可以包含任意列,其键提供自然索引。

Column family stores are modeled on Google’s BigTable. The data model is based on a sparsely populated table whose rows can contain arbitrary columns, the keys for which provide natural indexing.


笔记

在我们的讨论中,我们将使用 Apache 的术语Cassandra。Cassandra 不一定是 BigTable 的忠实解释,但它被广泛部署,并且其术语也很好理解。

In our discussion we’ll use terminology from Apache Cassandra. Cassandra isn’t necessarily a faithful interpretation of BigTable, but it is widely deployed, and its terminology well understood.


图 A-4中,我们看到了列族数据库。最简单的存储单元是本身,由名称-值对组成。可以将任意数量的列组合成一个超级列,超级列为已排序的列集提供名称。列存储在行中,当一行仅包含列时,它被称为列。当一行包含超级列时,它被称为超级列族

In Figure A-4, we see the four common building blocks used in column family databases. The simplest unit of storage is the column itself, consisting of a name-value pair. Any number of columns can be combined into a super column, which gives a name to a sorted set of columns. Columns are stored in rows, and when a row contains columns only, it is known as a column family. When a row contains super columns, it is known as a super column family.

grdb aa04
图 A-4.列族存储的四个构建块

当数据模型表面上是列式时,关注行似乎有点奇怪,但各个行很重要,因为它们提供了嵌套的哈希图结构,我们将数据反规范化到该结构中。在图 A-5中,我们展示了如何将唱片艺术家及其专辑映射到超级列族结构中 — 从逻辑上讲,它实际上只不过是地图的地图。

It might seem odd to focus on rows when the data model is ostensibly columnar, but individual rows are important, because they provide the nested hashmap structure into which we denormalize our data. In Figure A-5 we show how we might map a recording artist and his albums into a super column family structure — logically, it’s really nothing more than maps of maps.

grdb aa05
图 A-5.在超级列系列中存储业务线数据

在列族数据库中,表中的每一行代表一个特定的总体实体(例如,关于艺术家的一切)。这些列族是相关数据的容器,例如艺术家的姓名和唱片目录。在列族中,我们可以找到实际的键值数据,例如专辑发行日期和艺术家的出生日期。

In a column family database, each row in the table represents a particular overarching entity (e.g., everything about an artist). These column families are containers for related pieces of data, such as the artist’s name and discography. Within the column families we find actual key-value data, such as album release dates and the artist’s date of birth.

有用的是,这种面向行的视图可以旋转 90 度,得到面向列的视图。每行提供一个实体的完整视图,而列视图自然会索引整个数据集中的特定方面。例如,如图 A -6所示,通过“排列”键,我们能够找到艺术家是英语的所有行。从那里,很容易从每行中提取完整的艺术家数据。它不是我们在图表中找到的连接数据,但它至少提供了对一组相关实体的一些见解。

Helpfully, this row-oriented view can be turned 90 degrees to arrive at a column-oriented view. Where each row gives a complete view of one entity, the column view naturally indexes particular aspects across the whole dataset. For example, as we see in Figure A-6, by “lining up” keys we are able to find all the rows where the artist is English. From there it’s easy to extract complete artist data from each row. It’s not connected data as we’d find in a graph, but it does at least provide some insight into a set of related entities.

列族数据库与文档和键值存储的区别不仅在于其更具表现力的数据模型,还在于其操作特性。例如,基于 Dynamo 类基础架构的 Apache Cassandra 专为分布式、可扩展性和故障转移而设计。在底层,它使用多个存储引擎来处理高写入负载 — 即热门互动电视节目产生的峰值写入负载。

Column family databases are distinguished from document and key-value stores not only by their more expressive data model, but also by their operational characteristics. Apache Cassandra, for example, which is based on a Dynamo-like infrastructure, is architected for distribution, scale, and failover. Under the covers it uses several storage engines that deal with high write loads — the kind of peak write loads generated by popular interactive TV shows.

grdb aa06
图 A-6.键通过列族数据库中的行形成自然索引

总体而言,列族数据库具有合理的表达能力,并且操作性很强。但它们仍然是聚合存储,就像文档和键值数据库一样,因此仍然缺乏连接。查询它们以深入了解大规模数据需要一些外部应用程序基础设施进行处理。

Overall, column family databases are reasonably expressive, and operationally very competent. And yet they’re still aggregate stores, just like document and key-value databases, and as such still lack joins. Querying them for insight into data at scale requires processing by some external application infrastructure.

聚合存储中的查询与处理

Query versus Processing in Aggregate Stores

在前面的部分中,我们强调了文档、键值和列族数据模型之间的相似之处和不同之处。总的来说,相似之处大于不同之处。事实上,这三种类型的相似之处是如此之多,以至于它们有时被统称为聚合存储。聚合存储持久化独立的复杂记录,反映领域驱动设计聚合概念。

In the preceding sections we’ve highlighted the similarities and differences between the document, key-value, and column family data models. On balance, the similarities are greater than the differences. In fact, the similarities are so great, the three types are sometimes referred to jointly as aggregate stores. Aggregate stores persist standalone complex records that reflect the Domain-Driven Design notion of an aggregate.

虽然每个聚合存储都有不同的存储策略,但在查询数据方面,它们都有很多共同点。对于简单的即席查询,每个聚合存储都倾向于提供索引、简单文档链接或查询语言等功能。对于更复杂的查询,应用程序通常会从存储中识别并提取数据子集,然后将其通过某些外部处理基础架构(例如 MapReduce 框架)进行传输。当无法通过简单地检查单个聚合来生成必要的深度洞察时,就会这样做。

Though each aggregate store has a different storage strategy, they all have a great deal in common when it comes to querying data. For simple ad hoc queries, each tends to provide features such as indexing, simple document linking, or a query language. For more complex queries, applications commonly identify and extract a subset of data from the store before piping it through some external processing infrastructure such as a MapReduce framework. This is done when the necessary deep insight cannot be generated simply by examining individual aggregates.

MapReduce和 BigTable 一样,是另一种来自Google。MapReduce 最流行的开源实现是 Apache Hadoop 及其相关生态系统。

MapReduce, like BigTable, is another technique that comes to us from Google. The most prevalent open source implementation of MapReduce is Apache Hadoop and its attendant ecosystem.

MapReduce 是一种并行编程模型,它会拆分数据并对其进行并行操作,然后再将其重新聚集在一起并聚合以提供重点信息。例如,如果我们想用它来计算唱片艺术家数据库中有多少美国艺术家,我们会在映射阶段提取所有艺术家记录并丢弃非美国艺术家记录,然后在归约阶段计算剩余的记录。

MapReduce is a parallel programming model that splits data and operates on it in parallel before gathering it back together and aggregating it to provide focused information. If, for example, we wanted to use it to count how many American artists there are in a recording artists database, we’d extract all the artist records and discard the non-American ones in the map phase, and then count the remaining records in the reduce phase.

即使拥有大量机器和快速网络基础设施,MapReduce 也可能相当隐蔽。通常,我们会使用数据存储的功能来提供更集中的数据集(可能使用索引或其他临时查询),然后 MapReduce 该较小的数据集以得出我们的答案。

Even with a lot of machines and a fast network infrastructure, MapReduce can be quite latent. Normally, we’d use the features of the data store to provide a more focused dataset — perhaps using indexes or other ad hoc queries — and then MapReduce that smaller dataset to arrive at our answer.

聚合存储并非为处理高度关联数据而构建。我们可以尝试将它们用于此目的,但我们必须添加代码来填补底层数据模型的空白,从而导致开发体验远非无缝,并且操作特性通常不是很快,特别是随着跳数(或查询的“程度”)的增加。聚合存储可能擅长存储大数据,但它们并不擅长处理需要了解事物如何关联的问题。

Aggregate stores are not built to deal with highly connected data. We can try to use them for that purpose, but we have to add code to fill in where the underlying data model leaves off, resulting in a development experience that is far from seamless, and operational characteristics that are generally speaking not very fast, particularly as the number of hops (or “degree” of the query) increases. Aggregate stores may be good at storing data that’s big, but they aren’t great at dealing with problems that require an understanding of how things are connected.

图形数据库

Graph Databases

图形数据库​一个在线操作数据库管理系统,创建、读取、更新和删除 (CRUD) 方法,用于公开图形数据模型。图形数据库通常用于事务处理 (OLTP) 系统。因此,它们通常针对事务性能进行优化,并在设计时考虑事务完整性和操作可用性。

A graph database is an online, operational database management system with Create, Read, Update, and Delete (CRUD) methods that expose a graph data model. Graph databases are generally built for use with transactional (OLTP) systems. Accordingly, they are normally optimized for transactional performance, and engineered with transactional integrity and operational availability in mind.

在研究图形数据库技术时,了解图形数据库的两个属性很有用:

Two properties of graph databases are useful to understand when investigating graph database technologies:

底层存储
一些图形数据库使用原生图形存储,该存储经过优化,专为存储和管理图形而设计。然而,并非所有图形数据库技术都使用原生图形存储。有些将图形数据序列化为关系数据库、面向对象数据库或其他类型的 NOSQL 存储。
Some graph databases use native graph storage, which is optimized and designed for storing and managing graphs. Not all graph database technologies use native graph storage, however. Some serialize the graph data into a relational database, object-oriented database, or other types of NOSQL stores.
处理引擎
一些图形数据库的定义要求它们能够实现无索引邻接,这意味着连接的节点在数据库中物理上“指向”彼此。2这里,我们采取一个稍微更广泛的视角。任何从用户角度来看行为像图形数据库(即通过 CRUD 操作公开图形数据模型)的数据库都符合图形数据库的条件。然而,我们确实承认无索引邻接具有显著的性能优势,因此使用术语本机图形处理来指代利用无索引邻接的图形数据库。
Some definitions of graph databases require that they be capable of index-free adjacency, meaning that connected nodes physically “point” to each other in the database.2 Here we take a slightly broader view. Any database that from the user’s perspective behaves like a graph database (i.e., exposes a graph data model through CRUD operations), qualifies as a graph database. We do acknowledge, however, the significant performance advantages of index-free adjacency, and therefore use the term native graph processing in reference to graph databases that leverage index-free adjacency.

图形数据库(尤其是原生图形数据库)并不严重依赖索引,因为图形本身提供了自然的邻接索引。在原生图形数据库中,附加到节点的关系自然会提供与其他相关节点的直接连接。图形查询利用这种局部性通过追逐指针来遍历图形。这些操作可以极其高效地执行,每秒遍历数百万个节点,而通过全局索引连接数据的速度要慢几个数量级。

Graph databases — in particular native ones — don’t depend heavily on indexes because the graph itself provides a natural adjacency index. In a native graph database, the relationships attached to a node naturally provide a direct connection to other related nodes of interest. Graph queries use this locality to traverse through the graph by chasing pointers. These operations can be carried out with extreme efficiency, traversing millions of nodes per second, in contrast to joining data through a global index, which is many orders of magnitude slower.

除了采用特定的存储和处理方法外,图形数据库还将采用特定的数据模型。常用的图形数据模型有几种,包括属性图、超图和三元​​组。我们将在下面讨论这些模型。

Besides adopting a specific approach to storage and processing, a graph database will also adopt a specific data model. There are several different graph data models in common usage, including property graphs, hypergraphs, and triples. We discuss each of these models below.

属性图

Property Graphs

属性图具有以下特征:

A property graph has the following characteristics:

  • 它包含节点和关系。
  • It contains nodes and relationships.
  • 节点包含属性(键值对)。
  • Nodes contain properties (key-value pairs).
  • 节点可以标有一个或多个标签。
  • Nodes can be labeled with one or more labels.
  • 关系是有命名和指向的,并且始终有一个起始节点和终止节点。
  • Relationships are named and directed, and always have a start and end node.
  • 关系还可以包含属性。
  • Relationships can also contain properties.

超图

Hypergraphs

超图是一种广义的图形模型,其中的关系(称为超边)可以连接任意数量的节点。属性图模型允许关系只有一个起始节点和一个终止节点,而超图模型允许关系两端有任意数量的节点。当领域主要由多对多关系组成时,超图非常有用。例如,在图 A-7中,我们看到 Alice 和 Bob 是三辆车的车主。我们使用单个超边来表达这一点,而在属性图中,我们将使用六个关系。

A hypergraph is a generalized graph model in which a relationship (called a hyper-edge) can connect any number of nodes. Whereas the property graph model permits a relationship to have only one start node and one end node, the hypergraph model allows any number of nodes at either end of a relationship. Hypergraphs can be useful where the domain consists mainly of many-to-many relationships. For example, in Figure A-7 we see that Alice and Bob are the owners of three vehicles. We express this using a single hyper-edge, whereas in a property graph we would use six relationships.

grdb aa07
图 A-7.一个简单的(有向)超图

正如我们在第 3 章中讨论的那样,图使我们能够以一种易于可视化和理解的方式对问题域进行建模,并且可以高保真地捕捉我们在现实世界中遇到的数据的许多细微差别。虽然理论上超图可以生成准确、信息丰富的模型,但在实践中,我们在建模时很容易错过一些细节。为了说明这一点,让我们考虑图 A-8中所示的图,它是图 A-7中所示超图的属性图等价物。

As we discussed in Chapter 3, graphs enable us to model our problem domain in a way that is easy to visualize and understand, and which captures with high fidelity the many nuances of the data we encounter in the real world. Although in theory hypergraphs produce accurate, information-rich models, in practice it’s very easy for us to miss some detail while modeling. To illustrate this point, let’s consider the graph shown in Figure A-8, which is the property graph equivalent of the hypergraph shown in Figure A-7.

这里显示的属性图需要多个OWNS关系来表达超图仅用一个关系捕获的内容。但是,在使用多个关系时,我们不仅能够使用熟悉且非常明确的建模技术,而且还能够微调模型。例如,我们通过向相关关系添加属性来确定每辆车的“主要驾驶员”(出于保险目的)——这是单个超边无法做到的。

The property graph shown here requires several OWNS relationships to express what the hypergraph captured with just one. But in using several relationships, not only are we able to use a familiar and very explicit modeling technique, but we’re also able to fine-tune the model. For example, we’ve identified the “primary driver” for each vehicle (for insurance purposes) by adding a property to the relevant relationships — something that can’t be done with a single hyper-edge.

grdb aa08
图 A-8.属性图经过语义微调

笔记

由于超边是多维的,因此超图比属性图包含更通用的模型。也就是说,这两个模型是同构的。始终可以将超图中的信息表示为属性图(尽管使用更多关系和中间节点)。超图或属性图是否最适合您,将取决于您的建模思维方式和您正在构建的应用程序类型。有趣的是,对于大多数用途,属性图被广泛认为在实用性和建模效率之间具有最佳平衡 - 因此它们在图形数据库领域非常受欢迎。但是,在您需要捕获元意图、有效地限定一种关系与另一种关系(例如,我喜欢你喜欢那辆车这一事实)的情况下,超图通常需要比属性图更少的原语。

Because hyper-edges are multidimensional, hypergraphs comprise a more general model than property graphs. That said, the two models are isomorphic. It is always possible to represent the information in a hypergraph as a property graph (albeit using more relationships and intermediary nodes). Whether a hypergraph or a property graph is best for you is going to depend on your modeling mindset and the kinds of applications you’re building. Anecdotally, for most purposes property graphs are widely considered to have the best balance of pragmatism and modeling efficiency — hence their overwhelming popularity in the graph database space. However, in situations where you need to capture meta-intent, effectively qualifying one relationship with another (e.g., I like the fact that you liked that car), hypergraphs typically require fewer primitives than property graphs.


三元组

Triples

三元组存储源自语义网运动,研究人员对通过在连接网络资源的链接中添加语义标记来进行大规模知识推理感兴趣。到目前为止,很少有网络以有用的方式标记,因此跨语义层运行查询并不常见。相反,语义网中的大部分努力似乎都投入到从网络(或其他更平凡的数据源,如应用程序)中收集有用的数据和关系信息并将其存储在三元组存储中以供查询。

Triple stores come from the Semantic Web movement, where researchers are interested in large-scale knowledge inference by adding semantic markup to the links that connect web resources. To date, very little of the Web has been marked up in a useful fashion, so running queries across the semantic layer is uncommon. Instead, most effort in the Semantic Web appears to be invested in harvesting useful data and relationship information from the Web (or other more mundane data sources, such as applications) and depositing it in triple stores for querying.

三元组是一种主谓宾数据结构。使用三元组,我们可以捕获事实,例如“Ginger 和 Fred 一起跳舞”和“Fred 喜欢冰淇淋”。单个三元组在语义上相当差,但集合起来可以提供丰富的数据集,可以从中获取知识并推断联系。三元组存储通常提供SPARQL 功能来推理和存储 RDF 数据

A triple is a subject-predicate-object data structure. Using triples, we can capture facts, such as “Ginger dances with Fred” and “Fred likes ice cream.” Individually, single triples are semantically rather poor, but en-masse they provide a rich dataset from which to harvest knowledge and infer connections. Triple stores typically provide SPARQL capabilities to reason about and stored RDF data.

RDF 是三元组存储和语义网的通用语言,它可以通过多种方式进行序列化。以下代码片段展示了如何使用 RDF/XML 格式将三元组组合在一起形成链接数据:

RDF — the lingua franca of triple stores and the Semantic Web — can be serialized several ways. The following snippet shows how triples come together to form linked data, using the RDF/XML format:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns="http://www.example.org/terms/">
	<rdf:描述 rdf:关于="http://www.example.org/ginger">
		<name>金杰·罗杰斯</name>
		<occupation>舞者</occupation>
		<partner rdf:resource="http://www.example.org/fred"/>
	</rdf:描述>
	<rdf:描述 rdf:关于="http://www.example.org/fred">
		<name>弗雷德·阿斯坦</name>
		<occupation>舞者</occupation>
		<likes rdf:resource="http://www.example.org/ice-cream"/>
	</rdf:描述>
</rdf:RDF>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns="http://www.example.org/terms/">
	<rdf:Description rdf:about="http://www.example.org/ginger">
		<name>Ginger Rogers</name>
		<occupation>dancer</occupation>
		<partner rdf:resource="http://www.example.org/fred"/>
	</rdf:Description>
	<rdf:Description rdf:about="http://www.example.org/fred">
		<name>Fred Astaire</name>
		<occupation>dancer</occupation>
		<likes rdf:resource="http://www.example.org/ice-cream"/>
	</rdf:Description>
</rdf:RDF>

三元组存储属于图形数据库的一般类别,因为它们处理的数据一旦处理就倾向于逻辑链接。然而,它们不是“原生”图形数据库,因为它们不支持无索引邻接,其存储引擎也没有针对存储属性图进行优化。三元组存储将三元组存储为独立的工件,这允许它们水平扩展存储,但阻止它们快速遍历关系。要执行图形查询,三元组存储必须从独立的事实创建连接的结构,这会增加每个查询的延迟。出于这些原因,三元组存储的最佳点是分析,其中延迟是次要考虑因素,而不是 OLTP(响应式在线事务处理系统)。

Triple stores fall under the general category of graph databases because they deal in data that — once processed — tends to be logically linked. They are not, however, “native” graph databases, because they do not support index-free adjacency, nor are their storage engines optimized for storing property graphs. Triple stores store triples as independent artifacts, which allows them to scale horizontally for storage, but precludes them from rapidly traversing relationships. To perform graph queries, triple stores must create connected structures from independent facts, which adds latency to each query. For these reasons, the sweet spot for a triple store is analytics, where latency is a secondary consideration, rather than OLTP (responsive, online transaction processing systems).


笔记

尽管图数据库主要为遍历性能和执行图形算法而设计,因此可以将它们用作 RDF/SPARQL 端点后面的后备存储。例如,Blueprints SAIL API为多个图形数据库(包括 Neo4j)提供了 RDF 接口。实际上,这意味着图形数据库和三元组存储之间存在一定程度的功能同构。但是,每种存储类型都适用于不同类型的工作负载,图形数据库针对图形工作负载进行了优化,并且具有快速遍历。

Although graph databases are designed predominantly for traversal performance and executing graph algorithms, it is possible to use them as a backing store behind a RDF/SPARQL endpoint. For example, the Blueprints SAIL API provides an RDF interface to several graph databases, including Neo4j. In practice this implies a level of functional isomorphism between graph databases and triple stores. However, each store type is suited to a different kind of workload, with graph databases being optimized for graph workloads and rapid traversals.


1基于 .NET 的 RavenDB 在支持 ACID 事务方面逆势而上。正如我们在本书其他地方所展示的那样,许多图形数据库仍然支持 ACID 属性。

1 The .NET-based RavenDB has bucked the trend among aggregate stores in supporting ACID transactions. As we show elsewhere in the book, ACID properties are still upheld by many graph databases.

2请参阅 Rodriguez, Marko A. 和 Peter Neubauer。2011 年。 “图形遍历模式。”图形数据管理:技术和应用,Sherif Sakr 和 Eric Pardede 编辑,第 29-46 页。宾夕法尼亚州赫尔希:IGI Global。

2 See Rodriguez, Marko A., and Peter Neubauer. 2011. “The Graph Traversal Pattern.” In Graph Data Management: Techniques and Applications, ed. Sherif Sakr and Eric Pardede, 29-46. Hershey, PA: IGI Global.

指数

Index

一个

A

B

C

D

E

F

F

G

I

J

J

K

大号

L

N

O

P

Q

R

R

年代

S

电视

T

U

V

西

W

关于作者

About the Authors

Ian Robinson是REST in Practice (O'Reilly,2010 年)的合著者。Ian 是 Neo Technology 的工程师,致力于 Neo4j 数据库的分布式版本。在加入工程团队之前,Ian 曾担任 Neo 的客户成功总监,负责管理 Neo 的培训、专业服务和支持部门,并与客户合作设计和开发任务关键型图形数据库解决方案。Ian 从 ThoughtWorks 跳槽到 Neo Technology,在 ThoughtWorks 担任 SOA 实践主管和 CTO 全球技术顾问委员会成员。Ian 经常在全球各地的会议上发表演讲,主题包括图形数据库技术的应用和 RESTful 企业集成。

Ian Robinson is the co-author of REST in Practice (O’Reilly, 2010). Ian is an engineer at Neo Technology, working on a distributed version of the Neo4j database. Prior to joining the engineering team, Ian served as Neo’s Director of Customer Success, managing the training, professional services, and support arms of Neo, and working with customers to design and develop mission-critical graph database solutions. Ian came to Neo Technology from ThoughtWorks, where he was SOA Practice Lead and a member of the CTO’s global Technical Advisory Board. Ian presents frequently at conferences worldwide on topics including the application of graph database technologies and RESTful enterprise integration.

Jim Webber 博士是 Neo Technology 的首席科学家,他研究新型图形数据库并编写开源软件。此前,Jim 曾花时间研究 Web 等大型图形,以构建分布式系统,这使他成为《REST in Practice》一书的合著者,此前他曾撰写过《开发企业 Web 服务:架构师指南》(Prentice Hall,2003 年)。Jim 活跃于开发社区,经常在世界各地发表演讲。他的博客位于http://jimwebber.org,他经常以 @jimwebber 的身份发推文。

Dr. Jim Webber is Chief Scientist with Neo Technology where he researches novel graph databases and writes open source software. Previously, Jim spent time working with big graphs like the Web for building distributed systems, which led him to being co-author on the book REST in Practice, having previously written Developing Enterprise Web Services: An Architect’s Guide (Prentice Hall, 2003). Jim is active in the development community, presenting regularly around the world. His blog is located at http://jimwebber.org and he tweets often as @jimwebber.

Emil Eifrem是 Neo Technology 的首席执行官,也是 Neo4j 项目的联合创始人。在创立 Neo 之前,他曾担任 Windh AB 的首席技术官,负责开发企业内容管理系统的高度复杂信息架构。他致力于可持续开源,引导 Neo 在免费可用性和商业可靠性之间取得平衡。Emil 经常在会议上发言,也是 NOSQL 数据库方面的作者。

Emil Eifrem is CEO of Neo Technology and co-founder of the Neo4j project. Before founding Neo, he was the CTO of Windh AB, where he headed the development of highly complex information architectures for Enterprise Content Management Systems. Committed to sustainable open source, he guides Neo along a balanced path between free availability and commercial reliability. Emil is a frequent conference speaker and author on NOSQL databases.

版权页

Colophon

Graph Databases封面上的动物是欧洲章鱼(Eledone cirrhosa),也被称为小章鱼或角章鱼。欧洲章鱼原产于爱尔兰和英格兰的岩石海岸,但也可以在大西洋、北海和地中海找到。它主要栖息在 10 至 15 米深处,但已发现深达 800 米的生物。它的识别特征包括红橙色、白色腹部、皮肤上的颗粒和卵形外套膜。

The animal on the cover of Graph Databases is a European octopus (Eledone cirrhosa), also known as a lesser octopus or horned octopus. The European octopus is native to the rocky coasts of Ireland and England, but can also be found in the Atlantic Ocean, North Sea, and Mediterranean Sea. It mainly resides in depths of 10 to 15 meters, but has been noted as far down as 800 meters. Its identifying features include its reddish-orange color, white underside, granulations on its skin, and ovoid mantle.

欧洲章鱼主要以螃蟹和其他甲壳类动物为食。地中海和北海的许多渔业经常无意中捕获欧洲章鱼。该物种不受种群评估或配额控制,因此可以食用。然而,近年来,这些地区的欧洲章鱼数量有所增加,部分原因是过度捕捞大型掠食性鱼类。

The European octopus primarily eats crabs and other crustaceans. Many fisheries in the Mediterranean and North Seas often unintentionally catch the European octopus. The species is not subject to stock assessment or quota control, so they can be consumed. However, their population has increased in these areas in recent years, due in part to the overfishing of larger predatory fish.

欧洲章鱼可以长到 12 到 40 厘米长,大约需要一年时间。它的寿命相对较短,不到五年。与普通章鱼相比,欧洲章鱼的繁殖率要低得多,平均产卵 1,000 到 5,000 枚。

The European octopus can grow to be between 12 and 40 centimeters long, which it reaches in about one year. It has a relatively short life span of less than five years. Compared to the octopus vulgaris (or common octopus), the European octopus breeds at a much lower rate, laying on average 1,000 to 5,000 eggs.

O'Reilly 封面上的许多动物都濒临灭绝;它们对世界都很重要。要了解更多您可以如何提供帮助的信息,请访问animals.oreilly.com

Many of the animals on O’Reilly covers are endangered; all of them are important to the world. To learn more about how you can help, go to animals.oreilly.com.

封面图片来自 Dover Pictorial Archive。封面字体为 URW Typewriter 和 Guardian Sans。文本字体为 Adob​​e Minion Pro;标题字体为 Adob​​e Myriad Condensed;代码字体为 Dalton Maag 的 Ubuntu Mono。

The cover image is from Dover Pictorial Archive. The cover fonts are URW Typewriter and Guardian Sans. The text font is Adobe Minion Pro; the heading font is Adobe Myriad Condensed; and the code font is Dalton Maag’s Ubuntu Mono.